Spatial ETL Pipelines with GeoPandas and Shapely

The hardest part of geospatial data engineering is rarely the analysis. It is the cleaning. Shapefiles with inconsistent encodings, GeoJSON with self-intersecting polygons, CSVs with lat/lon columns swapped, coordinate systems that differ between datasets — these are the daily realities of working with real spatial data.

GeoPandas and Shapely form the backbone of the Python geospatial stack. GeoPandas extends Pandas with a geometry column and spatial operations; Shapely provides the underlying geometry types and predicates. Together they cover the full ETL lifecycle: extract from any source format, transform geometries and attributes, and load to any target.

Setting Up

pip install geopandas shapely pyproj fiona
# For PostGIS output:
pip install sqlalchemy geoalchemy2 psycopg2-binary

import geopandas as gpd
import pandas as pd
from shapely.geometry import Point, Polygon, MultiPolygon
from shapely.ops import unary_union
from shapely.validation import make_valid
import pyproj

Extracting: Reading Any Spatial Format

GeoPandas reads most spatial formats via GDAL/Fiona under the hood:

# Shapefile
gdf_parishes = gpd.read_file("data/parishes.shp")

# GeoJSON
gdf_points = gpd.read_file("data/gps_events.geojson")

# GeoPackage (preferred over Shapefile for most purposes)
gdf_oa = gpd.read_file("data/output_areas.gpkg", layer="oa_2021")

# From a URL (useful for open data portals)
gdf_lsoa = gpd.read_file(
    "https://services1.arcgis.com/ESMARspQHYMw9BZ9/arcgis/rest/"
    "services/LSOA_Dec_2021_Boundaries_Generalised_Clipped/FeatureServer/0/query?"
    "where=1%3D1&outFields=*&f=geojson"
)

# From a CSV with lat/lon columns
df_raw = pd.read_csv("data/incidents.csv")
gdf_incidents = gpd.GeoDataFrame(
    df_raw,
    geometry=gpd.points_from_xy(df_raw.longitude, df_raw.latitude),
    crs="EPSG:4326"  # Always declare the CRS explicitly
)

Always check what you have loaded:

print(gdf_parishes.crs)          # e.g. EPSG:27700 (British National Grid)
print(gdf_parishes.shape)        # (rows, columns)
print(gdf_parishes.geom_type.value_counts())  # Polygon, MultiPolygon, etc.
print(gdf_parishes.geometry.is_valid.all())   # Check for broken geometries

Transforming: CRS Reprojection

Mixing coordinate reference systems is the most common source of spatial join failures. Always reproject to a common CRS before any spatial operation:

# Check mismatched CRS
print(gdf_parishes.crs)   # EPSG:27700 (British National Grid, metres)
print(gdf_points.crs)     # EPSG:4326 (WGS84, degrees)

# Reproject points to match polygons
gdf_points_bng = gdf_points.to_crs("EPSG:27700")

# Or reproject to a metric CRS for accurate area/distance calculations
gdf_wgs84 = gdf_parishes.to_crs("EPSG:4326")  # For web mapping output

A rule of thumb: use EPSG:27700 (British National Grid) or a local UTM zone for any distance/area calculations in Great Britain. Use EPSG:4326 (WGS84) for web output and data interchange. Never compute areas or distances in degrees.

# Calculate area in hectares (BNG is in metres, so divide by 10,000)
gdf_parishes_bng = gdf_parishes.to_crs("EPSG:27700")
gdf_parishes_bng["area_ha"] = gdf_parishes_bng.geometry.area / 10_000

Geometry Validation and Repair

Invalid geometries — self-intersections, duplicate vertices, unclosed rings — silently break spatial operations downstream. Fix them before transforming:

# Identify invalid geometries
invalid_mask = ~gdf_parishes.geometry.is_valid
print(f"{invalid_mask.sum()} invalid geometries found")

# Inspect why they're invalid
from shapely.validation import explain_validity
for idx, geom in gdf_parishes[invalid_mask].geometry.items():
    print(f"Row {idx}: {explain_validity(geom)}")

# Repair with make_valid (Shapely 1.8+)
gdf_parishes["geometry"] = gdf_parishes.geometry.apply(
    lambda g: make_valid(g) if not g.is_valid else g
)

# make_valid can return GeometryCollections — flatten to Polygon/MultiPolygon
def to_multipolygon(geom):
    from shapely.geometry import GeometryCollection
    if isinstance(geom, (Polygon, MultiPolygon)):
        return geom
    if isinstance(geom, GeometryCollection):
        polys = [g for g in geom.geoms if isinstance(g, (Polygon, MultiPolygon))]
        if polys:
            return unary_union(polys)
    return None

gdf_parishes["geometry"] = gdf_parishes.geometry.apply(to_multipolygon)
gdf_parishes = gdf_parishes[gdf_parishes.geometry.notna()].copy()

Spatial Join: Point-in-Polygon Enrichment

The most common spatial ETL operation: assign each point the attributes of the polygon it falls within.

# Spatial join — assigns parish attributes to each point
gdf_enriched = gpd.sjoin(
    gdf_points_bng,       # left: points to enrich
    gdf_parishes_bng,     # right: polygons with attributes
    how="left",           # keep all points; unmatched get NaN
    predicate="within"    # alternatives: 'intersects', 'contains', 'crosses'
)

# Clean up the index columns added by sjoin
gdf_enriched = gdf_enriched.drop(columns=["index_right"])

print(gdf_enriched[["event_id", "geometry", "PARISH_NAME", "LAD23CD"]].head())

For large datasets, spatial joins can be slow. GeoPandas uses an STR-tree index internally, which helps, but for millions of points against thousands of polygons you may want to pre-filter by bounding box:

# Pre-filter points to bounding box of study area
bbox = gdf_parishes_bng.total_bounds  # (minx, miny, maxx, maxy)
gdf_points_clip = gdf_points_bng.cx[bbox[0]:bbox[2], bbox[1]:bbox[3]]

Dissolve and Aggregate

Dissolving polygons by an attribute (e.g. merging parishes into districts) is a common transformation:

# Dissolve parishes into local authority districts
gdf_lads = gdf_parishes_bng.dissolve(
    by="LAD23CD",
    aggfunc={
        "PARISH_NAME": "count",      # count of parishes per LAD
        "area_ha": "sum",            # total area
        "population": "sum"          # sum a numeric attribute
    }
).rename(columns={"PARISH_NAME": "parish_count"})

gdf_lads = gdf_lads.reset_index()
print(gdf_lads.shape)

Loading: Writing to Multiple Targets

# GeoPackage (lossless, no encoding issues, single file)
gdf_enriched.to_file("output/enriched_events.gpkg", driver="GPKG")

# GeoJSON (for web mapping)
gdf_enriched.to_crs("EPSG:4326").to_file(
    "output/enriched_events.geojson", driver="GeoJSON"
)

# GeoParquet (columnar, efficient for data engineering pipelines)
gdf_enriched.to_parquet("output/enriched_events.parquet")

# PostGIS via SQLAlchemy + GeoAlchemy2
from sqlalchemy import create_engine

engine = create_engine(
    "postgresql+psycopg2://user:password@localhost:5432/spatial_db"
)
gdf_enriched.to_postgis(
    "enriched_events",
    engine,
    if_exists="replace",
    index=False
)

A Complete Pipeline Function

Putting it all together into a reusable ETL function:

def run_spatial_etl(
    points_path: str,
    polygons_path: str,
    output_path: str,
    join_crs: str = "EPSG:27700"
) -> gpd.GeoDataFrame:

    print("Loading data...")
    gdf_points = gpd.read_file(points_path)
    gdf_polygons = gpd.read_file(polygons_path)

    print("Repairing geometries...")
    gdf_polygons["geometry"] = gdf_polygons.geometry.apply(
        lambda g: make_valid(g) if not g.is_valid else g
    )

    print(f"Reprojecting to {join_crs}...")
    gdf_points = gdf_points.to_crs(join_crs)
    gdf_polygons = gdf_polygons.to_crs(join_crs)

    print("Running spatial join...")
    gdf_result = gpd.sjoin(gdf_points, gdf_polygons, how="left", predicate="within")
    gdf_result = gdf_result.drop(columns=["index_right"])

    print(f"Writing output to {output_path}...")
    gdf_result.to_crs("EPSG:4326").to_file(output_path, driver="GeoJSON")

    print(f"Done. {len(gdf_result)} records written.")
    return gdf_result

result = run_spatial_etl(
    "data/gps_events.geojson",
    "data/output_areas.gpkg",
    "output/enriched_events.geojson"
)

Performance Tips for Large Datasets

When processing datasets with millions of features:

Use Dask-GeoPandas for parallelism: pip install dask-geopandas — it partitions GeoDataFrames across CPU cores
Pre-clip to your study area with .cx[] indexing before heavy operations
Use GeoParquet as an intermediate format — it is an order of magnitude faster to read than GeoJSON for large files
Spatial indexes are automatic in GeoPandas 0.10+ for sjoin, but call gdf.sindex explicitly to warm them up if running multiple joins against the same dataset
Stream large CSVs with pd.read_csv(chunksize=100_000) and build the GeoDataFrame incrementally

GeoPandas and Shapely will carry you through the large majority of geospatial ETL tasks you will encounter in a data engineering or geospatial engineering role. The patterns here — read, validate, reproject, join, aggregate, write — repeat across almost every spatial pipeline, regardless of the specific data or domain.

CDISaW

HalesAir

Spatial ETL Pipelines with GeoPandas and Shapely

Setting Up

Extracting: Reading Any Spatial Format

Transforming: CRS Reprojection

Geometry Validation and Repair

Spatial Join: Point-in-Polygon Enrichment

Dissolve and Aggregate

Loading: Writing to Multiple Targets

A Complete Pipeline Function

Performance Tips for Large Datasets

Related Posts

Building a Geospatial REST API with FastAPI and PostGIS

Cloud-Native Geospatial: GeoParquet, COGs, and STAC

H3 Hierarchy and Compaction: Multi-Resolution Spatial Analysis

Privacy & Preferences