Skip to main content
4 min read
Tools & Libraries · Intermediate ·

Working with OpenStreetMap at Scale

OpenStreetMap (OSM) is the world's largest open geographic database. Processing its full planet file for machine learning and spatial analytics requires specialised parsing strategies, efficient data structures, and cloud-scale infrastructure.

OpenStreetMap (OSM) is the world’s largest collaborative geographic database, maintained by over 10 million contributors and containing data on roads, buildings, land use, points of interest, waterways, and thousands of other feature types worldwide. The full planet file — a snapshot of all OSM data — weighs approximately 75 GB compressed (PBF format) and represents one of the most valuable open datasets in geospatial AI research.

Working with OSM at scale — beyond city or country extracts — demands a different set of tools and architectural patterns than standard GIS workflows.

OSM Data Model

Understanding the OSM data model is essential before writing a single line of parsing code. OSM has three primitive types:

Nodes — single coordinate points with optional tags (key-value attributes). Used for point features (trees, post boxes, shop entrances) and as building blocks for ways.

Ways — ordered sequences of node references. Closed ways (where the last node equals the first) represent polygons (buildings, parks, lakes). Open ways represent linear features (roads, rivers, walls).

Relations — ordered collections of nodes, ways, and other relations, with roles. Used for complex multipolygon features, turn restrictions, public transit routes, and administrative boundaries.

All three types can carry tags — arbitrary key-value pairs that encode feature semantics. highway=motorway, amenity=cafe, building=residential are examples. There are thousands of documented tag combinations on the OSM wiki.

Parsing Strategies

Osmium is the canonical low-level OSM parsing library, available in Python (osmium-tool, pyosmium). It provides an event-driven interface — you define handlers for nodes, ways, and relations — making it memory-efficient for streaming the full planet:

import osmium

class CafeHandler(osmium.SimpleHandler):
    def __init__(self):
        super().__init__()
        self.cafes = []

    def node(self, n):
        if n.tags.get('amenity') == 'cafe':
            self.cafes.append({
                'lat': n.location.lat,
                'lng': n.location.lon,
                'name': n.tags.get('name', ''),
            })

h = CafeHandler()
h.apply_file('planet-latest.osm.pbf')

osmnx wraps the Overpass API for interactive, city-scale downloads. It’s not suitable for planet-scale processing but excellent for rapid prototyping and network analysis on specific regions.

DuckDB + Overture — the Overture Maps Foundation distributes OSM-derived data as GeoParquet files on S3, queryable directly with DuckDB. For many use cases, this is faster and simpler than parsing raw OSM:

LOAD spatial;
SELECT id, geometry, names
FROM read_parquet('s3://overturemaps-us-west-2/release/2024-07-22/theme=places/**')
WHERE ST_Within(geometry, ST_GeomFromText('POLYGON((...))'));

Scale Challenges

Memory — the full planet loaded naively into Python dictionaries would require hundreds of GB of RAM. The two strategies are: (1) stream with Osmium without materialising the full graph, or (2) load into a database (PostGIS, DuckDB) and query in chunks.

Referential integrity — ways reference node IDs; relations reference way IDs. To resolve a way’s geometry, you need the coordinates of each referenced node. At planet scale, this requires either multiple passes or an indexed node cache.

Tag normalisation — OSM has no enforced schema. shop=coffee and amenity=cafe both describe coffee shops but appear under different keys. Building a useful feature dataset requires careful tag mapping and handling of tag inconsistencies across regions and contributor communities.

Regional variation — OSM completeness varies dramatically. Dense urban areas in Europe and North America are extremely detailed; many rural and Global South regions have sparse coverage. Models trained on OSM features must account for this data quality gradient.

OSM in GeoAI Pipelines

For machine learning applications, OSM is typically processed into one of two representations:

Feature vectors per location — for each H3 cell or spatial unit, count the number of OSM features of each type (roads by class, amenities by category, building counts). This creates a tabular feature matrix suitable for standard ML.

Graph structure — represent the road network or spatial adjacency directly as a graph, with OSM features as node and edge attributes. This is the input format for spatial GNNs.

My MORPHEME pipeline processes the full OSM planet to extract H3 cell feature matrices for 24 cities, using a multi-stage Osmium pipeline that runs in parallel across regional extracts before aggregating to a global feature store. WalkGrid similarly fuses 49 datasets — including OSM, Ordnance Survey, and Earth Observation — into an H3 cell graph for active travel routing.

Last updated 24 April 2026

Explore Further

Related Blog Posts