Pipeline Methods

A reproducible, open-source pipeline that ingests, conflates, scores, and tiles point-of-interest data from OpenStreetMap and Overture Maps.

Overview

The AnythingPOI pipeline fuses point-of-interest records from two complementary open datasets into a single, clean, deduplicated output. OSM provides community-verified, often hand-tagged data with strong geographic fidelity; Overture provides commercial-scale data with rich attribute coverage. Neither source alone is sufficient — OSM has coverage gaps in commercial POIs; Overture lacks the granular tagging and community verification of OSM.

The pipeline runs as a single Python process per region, with multiprocessing used for the OSM parse and confidence scoring stages. Runtime on a 16-core machine is approximately 2–4 hours for a full country.

OSM PBF
Geofabrik
+
Overture
S3 / DuckDB
Conflate
H3 + Jaro-Winkler
Score
0.01–0.99
💾
GeoParquet
PMTiles

1. Ingestion

OSM

OSM PBF files are downloaded from Geofabrik and parsed with PyOsmium using a streaming handler. The handler accepts any feature with a name or brand tag and rejects features whose primary tag appears on a non-POI blocklist (highways, boundaries, routes, waterways, landuse, etc.).

Way and relation geometries are resolved to centroids via the WKBFactory and Shapely. For large PBFs (>1.5 GB), the file is first clipped to the target bounding box using osmconvert, reducing parse time by 60–90%.

Each extracted record carries: osm_id, name, geometry type (node/way/relation), contact attributes, and a raw osm_category tag (e.g. amenity=restaurant).

Overture Maps

Overture places data is fetched via DuckDB with the httpfs extension, querying directly from the public S3 release path. The bounding-box filter is pushed down into the Parquet predicate, pulling only rows within the target region. Pre-downloaded regional Parquet files are used for offline or high-throughput runs.

Each Overture record carries: overture_id, name, primary and extended category strings, brand, address components, website, phone, and geometry (always a point).

Filtering

Both sources are filtered to named features only. OSM features are additionally filtered through the non-POI blocklist. Address-only nodes (features with addr:* tags but no POI primary key) are excluded. After extraction, both DataFrames are deduplicated by source ID.

2. Conflation

Spatial Blocking

All records are assigned an H3 index at resolution 11 (~24 m edge length, ~650 m² cell area). Rather than comparing all pairs globally — O(n²) and impractical at millions of records — candidates are compared only within the same H3 cell and its six immediate neighbours. This reduces the comparison space by four to five orders of magnitude while guaranteeing that any two features within ~150 m of each other are compared.

Candidate Matching

For each OSM record, Overture candidates in the same H3 neighbourhood are ranked by a composite score:

  1. Distance gate: Haversine distance <50 m (hard cutoff).
  2. Category agreement: Both records must map to the same Tier-1 category under the taxonomy crosswalk.
  3. Name similarity: Jaro–Winkler similarity ≥0.85 on normalised names (lowercase, punctuation stripped, common stopwords removed).

The highest-scoring Overture candidate above all three thresholds is taken as the match. Each Overture record can match at most one OSM record. Unmatched records from either source pass through as single-source POIs.

Attribute Merging

For matched pairs, attributes are merged with OSM values preferred for geographic fields (geometry, opening_hours, wikidata) and Overture values preferred for commercial fields (website, phone, brand, address components) when OSM values are absent.

3. Confidence Scoring

Each POI receives a confidence score in [0.01, 0.99] quantifying the reliability of its identity and data quality. The score is additive from a base of 0.5 with positive and negative signals:

SignalDirectionDeltaRationale
Dual-source (conflated)++0.05Independent corroboration from two distinct datasets
Wikidata QID present++0.15Encyclopaedic entity verification; strongest single signal
Website present++0.03Indicates an established, findable entity
Phone present++0.02Operational contact information
Street address present++0.02Geocodable physical presence
Opening hours present++0.02Implies active business with maintained data
OSM way/relation++0.03Mapped polygon implies deliberate, verified feature
Digit-only name−0.15Strong indicator of placeholder or malformed data
Very short name (≤2 chars)−0.10Likely incomplete data
Name is URL−0.10Data quality issue

Additionally, for conflated records, individual attribute match signals are recorded as boolean flags: web_match, phone_match, st_match, house_match, pc_match, wiki_match, cuisine_match. These drive the match quality breakdown visible in the explorer.

Observed score distribution (Aurora v0.1): Mean 0.73, median 0.73, IQR 0.68–0.79. Only 0.2% of records score below 0.5; 15% score ≥0.8. Wikidata QID is the dominant positive signal, present in 91–97% of conflated pairs across regions.

4. Taxonomy

The taxonomy is a two-level classification system mapping raw source tags to a consistent hierarchy:

  • 17 Tier-1 categories (e.g. Food & Beverage, Healthcare, Transportation)
  • 196 Tier-2 subcategories (e.g. Restaurant, Hospital, Bus Stop)
  • 577 OSM rules mapping key=value tag pairs to Tier-2
  • 1,539 Overture rules mapping Overture category strings to Tier-2

Classification rules are applied in priority order. OSM features are matched against an ordered list of tag keys (amenity, shop, tourism, leisure, office, …); the first matching key–value pair determines the Tier-2 class. Features that match no rule are classified as Other / Uncategorised and recorded in the unclassified report for iterative taxonomy improvement.

Explore the full taxonomy →

5. Output Formats

Cloud-Optimised GeoParquet

The primary analytical output is a set of 18 GeoParquet files, one per Tier-1 category (excluding Other/Uncategorised), written with Snappy compression and sorted by H3 cell for spatial locality. Each file contains 35 columns: geometry (WKB, EPSG:4326), source IDs, name, tier classification, contact attributes, address components, conflation metadata, and the confidence score.

Files are compatible with GeoPandas, DuckDB, QGIS, and any GeoParquet-aware reader. The geometry column is annotated with GeoParquet 1.0 metadata.

PMTiles

A single PMTiles v3 archive is generated per region for web map delivery. POIs are tiled into four zoom-band source layers to balance data density and load performance:

LayerZoom rangeContents
nationalz4–z6Highest-confidence POIs only (≥0.7); major landmarks, airports, hospitals
provincialz7–z10Regional POIs; confidence-filtered at each zoom
cityz11–z13Full city-scale data; all categories
streetz14–z16Complete dataset; all attributes included

Each tile feature carries: id, name, tier1, tier2, sources, confidence_score, website, phone, brand, and opening_hours.

Limitations

  • Static snapshot: Each release is a point-in-time snapshot. OSM and Overture are live datasets; the fused output reflects the state at pipeline run time.
  • Conflation recall: The 50 m distance gate and 0.85 name similarity threshold are conservative. Real-world duplicates with transliterated names, address-only matches, or large-polygon centroids may not be caught.
  • Taxonomy coverage: Features not matched by any rule fall into Other / Uncategorised. These are tracked in the per-region unclassified_report.json for iterative improvement.
  • Geometry type: All output records are points. For OSM ways/relations, the centroid is used. This means a large park or shopping centre is represented as a single point, not its footprint.
  • Address completeness: Address fields reflect what was present in source data. Coverage varies significantly by country and category (e.g. Parks & Nature has ~42% street coverage vs. ~97% for Healthcare in Canada).
  • Language: Names are stored as-is from source data. No transliteration or translation is applied.