Technical documentation
A reproducible, open-source pipeline that ingests, conflates, scores, and tiles point-of-interest data from OpenStreetMap and Overture Maps.
The AnythingPOI pipeline fuses point-of-interest records from two complementary open datasets into a single, clean, deduplicated output. OSM provides community-verified, often hand-tagged data with strong geographic fidelity; Overture provides commercial-scale data with rich attribute coverage. Neither source alone is sufficient — OSM has coverage gaps in commercial POIs; Overture lacks the granular tagging and community verification of OSM.
The pipeline runs as a single Python process per region, with multiprocessing used for the OSM parse and confidence scoring stages. Runtime on a 16-core machine is approximately 2–4 hours for a full country.
OSM PBF files are downloaded from Geofabrik and parsed with PyOsmium using a streaming handler. The handler accepts any feature with a name or brand tag and rejects features whose primary tag appears on a non-POI blocklist (highways, boundaries, routes, waterways, landuse, etc.).
Way and relation geometries are resolved to centroids via the WKBFactory and Shapely. For large PBFs (>1.5 GB), the file is first clipped to the target bounding box using osmconvert, reducing parse time by 60–90%.
Each extracted record carries: osm_id, name, geometry type (node/way/relation), contact attributes, and a raw osm_category tag (e.g. amenity=restaurant).
Overture places data is fetched via DuckDB with the httpfs extension, querying directly from the public S3 release path. The bounding-box filter is pushed down into the Parquet predicate, pulling only rows within the target region. Pre-downloaded regional Parquet files are used for offline or high-throughput runs.
Each Overture record carries: overture_id, name, primary and extended category strings, brand, address components, website, phone, and geometry (always a point).
Both sources are filtered to named features only. OSM features are additionally filtered through the non-POI blocklist. Address-only nodes (features with addr:* tags but no POI primary key) are excluded. After extraction, both DataFrames are deduplicated by source ID.
All records are assigned an H3 index at resolution 11 (~24 m edge length, ~650 m² cell area). Rather than comparing all pairs globally — O(n²) and impractical at millions of records — candidates are compared only within the same H3 cell and its six immediate neighbours. This reduces the comparison space by four to five orders of magnitude while guaranteeing that any two features within ~150 m of each other are compared.
For each OSM record, Overture candidates in the same H3 neighbourhood are ranked by a composite score:
The highest-scoring Overture candidate above all three thresholds is taken as the match. Each Overture record can match at most one OSM record. Unmatched records from either source pass through as single-source POIs.
For matched pairs, attributes are merged with OSM values preferred for geographic fields (geometry, opening_hours, wikidata) and Overture values preferred for commercial fields (website, phone, brand, address components) when OSM values are absent.
Each POI receives a confidence score in [0.01, 0.99] quantifying the reliability of its identity and data quality. The score is additive from a base of 0.5 with positive and negative signals:
| Signal | Direction | Delta | Rationale |
|---|---|---|---|
| Dual-source (conflated) | + | +0.05 | Independent corroboration from two distinct datasets |
| Wikidata QID present | + | +0.15 | Encyclopaedic entity verification; strongest single signal |
| Website present | + | +0.03 | Indicates an established, findable entity |
| Phone present | + | +0.02 | Operational contact information |
| Street address present | + | +0.02 | Geocodable physical presence |
| Opening hours present | + | +0.02 | Implies active business with maintained data |
| OSM way/relation | + | +0.03 | Mapped polygon implies deliberate, verified feature |
| Digit-only name | − | −0.15 | Strong indicator of placeholder or malformed data |
| Very short name (≤2 chars) | − | −0.10 | Likely incomplete data |
| Name is URL | − | −0.10 | Data quality issue |
Additionally, for conflated records, individual attribute match signals are recorded as boolean flags: web_match, phone_match, st_match, house_match, pc_match, wiki_match, cuisine_match. These drive the match quality breakdown visible in the explorer.
The taxonomy is a two-level classification system mapping raw source tags to a consistent hierarchy:
key=value tag pairs to Tier-2Classification rules are applied in priority order. OSM features are matched against an ordered list of tag keys (amenity, shop, tourism, leisure, office, …); the first matching key–value pair determines the Tier-2 class. Features that match no rule are classified as Other / Uncategorised and recorded in the unclassified report for iterative taxonomy improvement.
The primary analytical output is a set of 18 GeoParquet files, one per Tier-1 category (excluding Other/Uncategorised), written with Snappy compression and sorted by H3 cell for spatial locality. Each file contains 35 columns: geometry (WKB, EPSG:4326), source IDs, name, tier classification, contact attributes, address components, conflation metadata, and the confidence score.
Files are compatible with GeoPandas, DuckDB, QGIS, and any GeoParquet-aware reader. The geometry column is annotated with GeoParquet 1.0 metadata.
A single PMTiles v3 archive is generated per region for web map delivery. POIs are tiled into four zoom-band source layers to balance data density and load performance:
| Layer | Zoom range | Contents |
|---|---|---|
national | z4–z6 | Highest-confidence POIs only (≥0.7); major landmarks, airports, hospitals |
provincial | z7–z10 | Regional POIs; confidence-filtered at each zoom |
city | z11–z13 | Full city-scale data; all categories |
street | z14–z16 | Complete dataset; all attributes included |
Each tile feature carries: id, name, tier1, tier2, sources, confidence_score, website, phone, brand, and opening_hours.
unclassified_report.json for iterative improvement.