The hardest part of geospatial data engineering is rarely the analysis. It is the cleaning. Shapefiles with inconsistent encodings, GeoJSON with self-intersecting polygons, CSVs with lat/lon columns swapped, coordinate systems that differ between datasets — these are the daily realities of working with real spatial data.
GeoPandas and Shapely form the backbone of the Python geospatial stack. GeoPandas extends Pandas with a geometry column and spatial operations; Shapely provides the underlying geometry types and predicates. Together they cover the full ETL lifecycle: extract from any source format, transform geometries and attributes, and load to any target.
Setting Up
pip install geopandas shapely pyproj fiona
# For PostGIS output:
pip install sqlalchemy geoalchemy2 psycopg2-binary
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point, Polygon, MultiPolygon
from shapely.ops import unary_union
from shapely.validation import make_valid
import pyproj
Extracting: Reading Any Spatial Format
GeoPandas reads most spatial formats via GDAL/Fiona under the hood:
# Shapefile
gdf_parishes = gpd.read_file("data/parishes.shp")
# GeoJSON
gdf_points = gpd.read_file("data/gps_events.geojson")
# GeoPackage (preferred over Shapefile for most purposes)
gdf_oa = gpd.read_file("data/output_areas.gpkg", layer="oa_2021")
# From a URL (useful for open data portals)
gdf_lsoa = gpd.read_file(
"https://services1.arcgis.com/ESMARspQHYMw9BZ9/arcgis/rest/"
"services/LSOA_Dec_2021_Boundaries_Generalised_Clipped/FeatureServer/0/query?"
"where=1%3D1&outFields=*&f=geojson"
)
# From a CSV with lat/lon columns
df_raw = pd.read_csv("data/incidents.csv")
gdf_incidents = gpd.GeoDataFrame(
df_raw,
geometry=gpd.points_from_xy(df_raw.longitude, df_raw.latitude),
crs="EPSG:4326" # Always declare the CRS explicitly
)
Always check what you have loaded:
print(gdf_parishes.crs) # e.g. EPSG:27700 (British National Grid)
print(gdf_parishes.shape) # (rows, columns)
print(gdf_parishes.geom_type.value_counts()) # Polygon, MultiPolygon, etc.
print(gdf_parishes.geometry.is_valid.all()) # Check for broken geometries
Transforming: CRS Reprojection
Mixing coordinate reference systems is the most common source of spatial join failures. Always reproject to a common CRS before any spatial operation:
# Check mismatched CRS
print(gdf_parishes.crs) # EPSG:27700 (British National Grid, metres)
print(gdf_points.crs) # EPSG:4326 (WGS84, degrees)
# Reproject points to match polygons
gdf_points_bng = gdf_points.to_crs("EPSG:27700")
# Or reproject to a metric CRS for accurate area/distance calculations
gdf_wgs84 = gdf_parishes.to_crs("EPSG:4326") # For web mapping output
A rule of thumb: use EPSG:27700 (British National Grid) or a local UTM zone for any distance/area calculations in Great Britain. Use EPSG:4326 (WGS84) for web output and data interchange. Never compute areas or distances in degrees.
# Calculate area in hectares (BNG is in metres, so divide by 10,000)
gdf_parishes_bng = gdf_parishes.to_crs("EPSG:27700")
gdf_parishes_bng["area_ha"] = gdf_parishes_bng.geometry.area / 10_000
Geometry Validation and Repair
Invalid geometries — self-intersections, duplicate vertices, unclosed rings — silently break spatial operations downstream. Fix them before transforming:
# Identify invalid geometries
invalid_mask = ~gdf_parishes.geometry.is_valid
print(f"{invalid_mask.sum()} invalid geometries found")
# Inspect why they're invalid
from shapely.validation import explain_validity
for idx, geom in gdf_parishes[invalid_mask].geometry.items():
print(f"Row {idx}: {explain_validity(geom)}")
# Repair with make_valid (Shapely 1.8+)
gdf_parishes["geometry"] = gdf_parishes.geometry.apply(
lambda g: make_valid(g) if not g.is_valid else g
)
# make_valid can return GeometryCollections — flatten to Polygon/MultiPolygon
def to_multipolygon(geom):
from shapely.geometry import GeometryCollection
if isinstance(geom, (Polygon, MultiPolygon)):
return geom
if isinstance(geom, GeometryCollection):
polys = [g for g in geom.geoms if isinstance(g, (Polygon, MultiPolygon))]
if polys:
return unary_union(polys)
return None
gdf_parishes["geometry"] = gdf_parishes.geometry.apply(to_multipolygon)
gdf_parishes = gdf_parishes[gdf_parishes.geometry.notna()].copy()
Spatial Join: Point-in-Polygon Enrichment
The most common spatial ETL operation: assign each point the attributes of the polygon it falls within.
# Spatial join — assigns parish attributes to each point
gdf_enriched = gpd.sjoin(
gdf_points_bng, # left: points to enrich
gdf_parishes_bng, # right: polygons with attributes
how="left", # keep all points; unmatched get NaN
predicate="within" # alternatives: 'intersects', 'contains', 'crosses'
)
# Clean up the index columns added by sjoin
gdf_enriched = gdf_enriched.drop(columns=["index_right"])
print(gdf_enriched[["event_id", "geometry", "PARISH_NAME", "LAD23CD"]].head())
For large datasets, spatial joins can be slow. GeoPandas uses an STR-tree index internally, which helps, but for millions of points against thousands of polygons you may want to pre-filter by bounding box:
# Pre-filter points to bounding box of study area
bbox = gdf_parishes_bng.total_bounds # (minx, miny, maxx, maxy)
gdf_points_clip = gdf_points_bng.cx[bbox[0]:bbox[2], bbox[1]:bbox[3]]
Dissolve and Aggregate
Dissolving polygons by an attribute (e.g. merging parishes into districts) is a common transformation:
# Dissolve parishes into local authority districts
gdf_lads = gdf_parishes_bng.dissolve(
by="LAD23CD",
aggfunc={
"PARISH_NAME": "count", # count of parishes per LAD
"area_ha": "sum", # total area
"population": "sum" # sum a numeric attribute
}
).rename(columns={"PARISH_NAME": "parish_count"})
gdf_lads = gdf_lads.reset_index()
print(gdf_lads.shape)
Loading: Writing to Multiple Targets
# GeoPackage (lossless, no encoding issues, single file)
gdf_enriched.to_file("output/enriched_events.gpkg", driver="GPKG")
# GeoJSON (for web mapping)
gdf_enriched.to_crs("EPSG:4326").to_file(
"output/enriched_events.geojson", driver="GeoJSON"
)
# GeoParquet (columnar, efficient for data engineering pipelines)
gdf_enriched.to_parquet("output/enriched_events.parquet")
# PostGIS via SQLAlchemy + GeoAlchemy2
from sqlalchemy import create_engine
engine = create_engine(
"postgresql+psycopg2://user:password@localhost:5432/spatial_db"
)
gdf_enriched.to_postgis(
"enriched_events",
engine,
if_exists="replace",
index=False
)
A Complete Pipeline Function
Putting it all together into a reusable ETL function:
def run_spatial_etl(
points_path: str,
polygons_path: str,
output_path: str,
join_crs: str = "EPSG:27700"
) -> gpd.GeoDataFrame:
print("Loading data...")
gdf_points = gpd.read_file(points_path)
gdf_polygons = gpd.read_file(polygons_path)
print("Repairing geometries...")
gdf_polygons["geometry"] = gdf_polygons.geometry.apply(
lambda g: make_valid(g) if not g.is_valid else g
)
print(f"Reprojecting to {join_crs}...")
gdf_points = gdf_points.to_crs(join_crs)
gdf_polygons = gdf_polygons.to_crs(join_crs)
print("Running spatial join...")
gdf_result = gpd.sjoin(gdf_points, gdf_polygons, how="left", predicate="within")
gdf_result = gdf_result.drop(columns=["index_right"])
print(f"Writing output to {output_path}...")
gdf_result.to_crs("EPSG:4326").to_file(output_path, driver="GeoJSON")
print(f"Done. {len(gdf_result)} records written.")
return gdf_result
result = run_spatial_etl(
"data/gps_events.geojson",
"data/output_areas.gpkg",
"output/enriched_events.geojson"
)
Performance Tips for Large Datasets
When processing datasets with millions of features:
- Use Dask-GeoPandas for parallelism:
pip install dask-geopandas— it partitions GeoDataFrames across CPU cores - Pre-clip to your study area with
.cx[]indexing before heavy operations - Use GeoParquet as an intermediate format — it is an order of magnitude faster to read than GeoJSON for large files
- Spatial indexes are automatic in GeoPandas 0.10+ for
sjoin, but callgdf.sindexexplicitly to warm them up if running multiple joins against the same dataset - Stream large CSVs with
pd.read_csv(chunksize=100_000)and build the GeoDataFrame incrementally
GeoPandas and Shapely will carry you through the large majority of geospatial ETL tasks you will encounter in a data engineering or geospatial engineering role. The patterns here — read, validate, reproject, join, aggregate, write — repeat across almost every spatial pipeline, regardless of the specific data or domain.