Skip to main content
/ 5 min read

Spatial ETL Pipelines with GeoPandas and Shapely

Building robust, reproducible geospatial data transformation workflows entirely in Python

Featured image for Spatial ETL Pipelines with GeoPandas and Shapely - Building robust, reproducible geospatial data transformation workflows entirely in Python

"Most geospatial data arrives dirty: mismatched projections, broken geometries, inconsistent attribute schemas. GeoPandas and Shapely give you a powerful, Pythonic toolkit for cleaning, transforming, and enriching spatial data before it reaches your database or analysis pipeline."

The hardest part of geospatial data engineering is rarely the analysis. It is the cleaning. Shapefiles with inconsistent encodings, GeoJSON with self-intersecting polygons, CSVs with lat/lon columns swapped, coordinate systems that differ between datasets — these are the daily realities of working with real spatial data.

GeoPandas and Shapely form the backbone of the Python geospatial stack. GeoPandas extends Pandas with a geometry column and spatial operations; Shapely provides the underlying geometry types and predicates. Together they cover the full ETL lifecycle: extract from any source format, transform geometries and attributes, and load to any target.

Setting Up

pip install geopandas shapely pyproj fiona
# For PostGIS output:
pip install sqlalchemy geoalchemy2 psycopg2-binary
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point, Polygon, MultiPolygon
from shapely.ops import unary_union
from shapely.validation import make_valid
import pyproj

Extracting: Reading Any Spatial Format

GeoPandas reads most spatial formats via GDAL/Fiona under the hood:

# Shapefile
gdf_parishes = gpd.read_file("data/parishes.shp")

# GeoJSON
gdf_points = gpd.read_file("data/gps_events.geojson")

# GeoPackage (preferred over Shapefile for most purposes)
gdf_oa = gpd.read_file("data/output_areas.gpkg", layer="oa_2021")

# From a URL (useful for open data portals)
gdf_lsoa = gpd.read_file(
    "https://services1.arcgis.com/ESMARspQHYMw9BZ9/arcgis/rest/"
    "services/LSOA_Dec_2021_Boundaries_Generalised_Clipped/FeatureServer/0/query?"
    "where=1%3D1&outFields=*&f=geojson"
)

# From a CSV with lat/lon columns
df_raw = pd.read_csv("data/incidents.csv")
gdf_incidents = gpd.GeoDataFrame(
    df_raw,
    geometry=gpd.points_from_xy(df_raw.longitude, df_raw.latitude),
    crs="EPSG:4326"  # Always declare the CRS explicitly
)

Always check what you have loaded:

print(gdf_parishes.crs)          # e.g. EPSG:27700 (British National Grid)
print(gdf_parishes.shape)        # (rows, columns)
print(gdf_parishes.geom_type.value_counts())  # Polygon, MultiPolygon, etc.
print(gdf_parishes.geometry.is_valid.all())   # Check for broken geometries

Transforming: CRS Reprojection

Mixing coordinate reference systems is the most common source of spatial join failures. Always reproject to a common CRS before any spatial operation:

# Check mismatched CRS
print(gdf_parishes.crs)   # EPSG:27700 (British National Grid, metres)
print(gdf_points.crs)     # EPSG:4326 (WGS84, degrees)

# Reproject points to match polygons
gdf_points_bng = gdf_points.to_crs("EPSG:27700")

# Or reproject to a metric CRS for accurate area/distance calculations
gdf_wgs84 = gdf_parishes.to_crs("EPSG:4326")  # For web mapping output

A rule of thumb: use EPSG:27700 (British National Grid) or a local UTM zone for any distance/area calculations in Great Britain. Use EPSG:4326 (WGS84) for web output and data interchange. Never compute areas or distances in degrees.

# Calculate area in hectares (BNG is in metres, so divide by 10,000)
gdf_parishes_bng = gdf_parishes.to_crs("EPSG:27700")
gdf_parishes_bng["area_ha"] = gdf_parishes_bng.geometry.area / 10_000

Geometry Validation and Repair

Invalid geometries — self-intersections, duplicate vertices, unclosed rings — silently break spatial operations downstream. Fix them before transforming:

# Identify invalid geometries
invalid_mask = ~gdf_parishes.geometry.is_valid
print(f"{invalid_mask.sum()} invalid geometries found")

# Inspect why they're invalid
from shapely.validation import explain_validity
for idx, geom in gdf_parishes[invalid_mask].geometry.items():
    print(f"Row {idx}: {explain_validity(geom)}")

# Repair with make_valid (Shapely 1.8+)
gdf_parishes["geometry"] = gdf_parishes.geometry.apply(
    lambda g: make_valid(g) if not g.is_valid else g
)

# make_valid can return GeometryCollections — flatten to Polygon/MultiPolygon
def to_multipolygon(geom):
    from shapely.geometry import GeometryCollection
    if isinstance(geom, (Polygon, MultiPolygon)):
        return geom
    if isinstance(geom, GeometryCollection):
        polys = [g for g in geom.geoms if isinstance(g, (Polygon, MultiPolygon))]
        if polys:
            return unary_union(polys)
    return None

gdf_parishes["geometry"] = gdf_parishes.geometry.apply(to_multipolygon)
gdf_parishes = gdf_parishes[gdf_parishes.geometry.notna()].copy()

Spatial Join: Point-in-Polygon Enrichment

The most common spatial ETL operation: assign each point the attributes of the polygon it falls within.

# Spatial join — assigns parish attributes to each point
gdf_enriched = gpd.sjoin(
    gdf_points_bng,       # left: points to enrich
    gdf_parishes_bng,     # right: polygons with attributes
    how="left",           # keep all points; unmatched get NaN
    predicate="within"    # alternatives: 'intersects', 'contains', 'crosses'
)

# Clean up the index columns added by sjoin
gdf_enriched = gdf_enriched.drop(columns=["index_right"])

print(gdf_enriched[["event_id", "geometry", "PARISH_NAME", "LAD23CD"]].head())

For large datasets, spatial joins can be slow. GeoPandas uses an STR-tree index internally, which helps, but for millions of points against thousands of polygons you may want to pre-filter by bounding box:

# Pre-filter points to bounding box of study area
bbox = gdf_parishes_bng.total_bounds  # (minx, miny, maxx, maxy)
gdf_points_clip = gdf_points_bng.cx[bbox[0]:bbox[2], bbox[1]:bbox[3]]

Dissolve and Aggregate

Dissolving polygons by an attribute (e.g. merging parishes into districts) is a common transformation:

# Dissolve parishes into local authority districts
gdf_lads = gdf_parishes_bng.dissolve(
    by="LAD23CD",
    aggfunc={
        "PARISH_NAME": "count",      # count of parishes per LAD
        "area_ha": "sum",            # total area
        "population": "sum"          # sum a numeric attribute
    }
).rename(columns={"PARISH_NAME": "parish_count"})

gdf_lads = gdf_lads.reset_index()
print(gdf_lads.shape)

Loading: Writing to Multiple Targets

# GeoPackage (lossless, no encoding issues, single file)
gdf_enriched.to_file("output/enriched_events.gpkg", driver="GPKG")

# GeoJSON (for web mapping)
gdf_enriched.to_crs("EPSG:4326").to_file(
    "output/enriched_events.geojson", driver="GeoJSON"
)

# GeoParquet (columnar, efficient for data engineering pipelines)
gdf_enriched.to_parquet("output/enriched_events.parquet")

# PostGIS via SQLAlchemy + GeoAlchemy2
from sqlalchemy import create_engine

engine = create_engine(
    "postgresql+psycopg2://user:password@localhost:5432/spatial_db"
)
gdf_enriched.to_postgis(
    "enriched_events",
    engine,
    if_exists="replace",
    index=False
)

A Complete Pipeline Function

Putting it all together into a reusable ETL function:

def run_spatial_etl(
    points_path: str,
    polygons_path: str,
    output_path: str,
    join_crs: str = "EPSG:27700"
) -> gpd.GeoDataFrame:

    print("Loading data...")
    gdf_points = gpd.read_file(points_path)
    gdf_polygons = gpd.read_file(polygons_path)

    print("Repairing geometries...")
    gdf_polygons["geometry"] = gdf_polygons.geometry.apply(
        lambda g: make_valid(g) if not g.is_valid else g
    )

    print(f"Reprojecting to {join_crs}...")
    gdf_points = gdf_points.to_crs(join_crs)
    gdf_polygons = gdf_polygons.to_crs(join_crs)

    print("Running spatial join...")
    gdf_result = gpd.sjoin(gdf_points, gdf_polygons, how="left", predicate="within")
    gdf_result = gdf_result.drop(columns=["index_right"])

    print(f"Writing output to {output_path}...")
    gdf_result.to_crs("EPSG:4326").to_file(output_path, driver="GeoJSON")

    print(f"Done. {len(gdf_result)} records written.")
    return gdf_result

result = run_spatial_etl(
    "data/gps_events.geojson",
    "data/output_areas.gpkg",
    "output/enriched_events.geojson"
)

Performance Tips for Large Datasets

When processing datasets with millions of features:

  1. Use Dask-GeoPandas for parallelism: pip install dask-geopandas — it partitions GeoDataFrames across CPU cores
  2. Pre-clip to your study area with .cx[] indexing before heavy operations
  3. Use GeoParquet as an intermediate format — it is an order of magnitude faster to read than GeoJSON for large files
  4. Spatial indexes are automatic in GeoPandas 0.10+ for sjoin, but call gdf.sindex explicitly to warm them up if running multiple joins against the same dataset
  5. Stream large CSVs with pd.read_csv(chunksize=100_000) and build the GeoDataFrame incrementally

GeoPandas and Shapely will carry you through the large majority of geospatial ETL tasks you will encounter in a data engineering or geospatial engineering role. The patterns here — read, validate, reproject, join, aggregate, write — repeat across almost every spatial pipeline, regardless of the specific data or domain.

James Williams
Dr James Williams
Research Fellow

Researching the intersection of place, maps, and technology.

More about me →

Posts on this blog are refined using AI. All ideas, research, and technical content originate with the author; AI assists with drafting and editing.