Setting up a PostGIS instance to do spatial analysis is, at minimum, a twenty-minute affair: install PostgreSQL, install PostGIS, configure the server, create a database, load your data, write your queries. For exploratory analysis or research pipelines that need to run on a laptop without a running server, this overhead is often unjustifiable.
DuckDB offers a different model. It is an in-process OLAP database — meaning it runs embedded in your Python process, your Rust binary, or your R session, with no network socket, no background service, and no configuration files. Load a file, write SQL, get results. The spatial extension adds full geometry support on top of this.
Installing the Spatial Extension
The spatial extension is bundled with DuckDB but needs to be explicitly loaded. In any DuckDB session:
INSTALL spatial;
LOAD spatial;
This is a one-time install per machine. After that, LOAD spatial; at the start of each session is sufficient. The extension adds support for the familiar spatial functions from PostGIS: ST_Within, ST_Distance, ST_Intersects, ST_Area, and many more, including geometry type constructors.
Reading Spatial Data from Parquet
One of DuckDB’s strongest features is its native Parquet support. You can query a Parquet file on disk (or in S3) without importing it first:
-- Query a Parquet file of GPS points directly
SELECT
id,
lat,
lng,
ST_Point(lng, lat) AS geom
FROM read_parquet('gps_points.parquet')
LIMIT 5;
For a file containing 5 million GPS records, this query returns in under a second on a modern laptop. DuckDB reads only the columns it needs (columnar projection pushdown) and uses vectorised execution throughout.
Spatial Filtering with ST_Within
Now the interesting part. Suppose you want to filter all GPS points to those within a specific administrative boundary — say, the City of Nottingham polygon. You can define the polygon inline as WKT:
LOAD spatial;
-- Filter points inside a bounding polygon
WITH boundary AS (
SELECT ST_GeomFromText(
'POLYGON((-1.25 52.88, -1.08 52.88, -1.08 53.02, -1.25 53.02, -1.25 52.88))'
) AS geom
)
SELECT
p.id,
p.lat,
p.lng,
ST_Distance(
ST_Point(p.lng, p.lat)::GEOMETRY,
ST_Centroid(b.geom)
) AS dist_to_centre
FROM read_parquet('gps_points.parquet') p
CROSS JOIN boundary b
WHERE ST_Within(ST_Point(p.lng, p.lat)::GEOMETRY, b.geom)
ORDER BY dist_to_centre
LIMIT 20;
This query does a spatial filter (ST_Within) and computes distance to the centroid of the boundary, all in a single pass over the Parquet file. On 5 million records it typically runs in 2–4 seconds on a standard laptop — without any spatial index.
For repeated queries over the same dataset, materialising it into a DuckDB table with a spatial index improves subsequent query times considerably:
-- Materialise to a DuckDB table for repeated analysis
CREATE TABLE points AS
SELECT id, lat, lng, ST_Point(lng, lat)::GEOMETRY AS geom
FROM read_parquet('gps_points.parquet');
CREATE INDEX points_spatial ON points USING RTREE (geom);
Using the Python API
DuckDB’s Python package integrates cleanly with Pandas and Polars, making it natural to use in Jupyter notebooks or analysis scripts:
import duckdb
import pandas as pd
con = duckdb.connect() # in-memory, or pass a file path to persist
con.execute("INSTALL spatial; LOAD spatial;")
# Query directly to a Pandas DataFrame
result_df = con.execute("""
SELECT
id,
lat,
lng,
ST_AsText(ST_Point(lng, lat)) AS wkt
FROM read_parquet('gps_points.parquet')
WHERE lat BETWEEN 52.88 AND 53.02
AND lng BETWEEN -1.25 AND -1.08
""").df()
print(result_df.head())
You can also pass Pandas DataFrames directly into DuckDB queries using the df variable binding:
# Load an existing GeoDataFrame's attribute data into DuckDB
import geopandas as gpd
boundaries_gdf = gpd.read_file("oa_boundaries.geojson")
# Register the GeoDataFrame as a virtual table
con.register("boundaries", boundaries_gdf)
result = con.execute("""
SELECT
p.id,
b."OA21CD" AS area_code
FROM read_parquet('gps_points.parquet') p
JOIN boundaries b
ON ST_Within(
ST_Point(p.lng, p.lat)::GEOMETRY,
ST_GeomFromWKB(b.geometry)
)
""").df()
This pattern — DuckDB as the processing engine, GeoPandas for data loading, Pandas for downstream analysis — avoids the PostGIS infrastructure requirement entirely while maintaining SQL expressiveness.
Reading GeoJSON and Shapefiles
The spatial extension can also read GeoJSON and shapefiles directly:
-- Read a GeoJSON file as a table
SELECT *
FROM ST_Read('output_areas.geojson')
LIMIT 5;
-- Read a Shapefile
SELECT name, ST_Area(geom) AS area_m2
FROM ST_Read('parishes.shp');
ST_Read uses GDAL under the hood, so any GDAL-supported format works. This is useful for ad-hoc analysis where you do not want to import data into a dedicated spatial database first.
Aggregate Spatial Functions
DuckDB supports spatial aggregation, which is where the analytical power becomes clear:
-- Count points per H3 hex (resolution 8) using H3 extension
INSTALL h3 FROM community;
LOAD h3;
SELECT
h3_latlng_to_cell(lat, lng, 8) AS h3_cell,
COUNT(*) AS point_count,
AVG(speed) AS avg_speed
FROM read_parquet('gps_points.parquet')
GROUP BY h3_cell
ORDER BY point_count DESC
LIMIT 20;
DuckDB also has an H3 community extension that integrates H3 indexing natively into SQL — no Python wrapper needed.
DuckDB vs PostGIS: When to Use Each
DuckDB’s spatial capabilities are genuinely impressive, but they are not a replacement for PostGIS in every scenario:
| Scenario | DuckDB | PostGIS |
|---|---|---|
| Exploratory analysis on a laptop | ✅ Excellent | ❌ Overkill |
| Sharing a query with collaborators | ✅ Just a file | ❌ Needs server access |
| Concurrent read/write applications | ❌ Single writer | ✅ Full MVCC |
| Multi-TB datasets with spatial indexes | ❌ Limited | ✅ Mature |
| Serverless / Lambda deployment | ✅ Embed directly | ❌ Hard |
| Complex topology operations | ❌ Limited | ✅ Full GEOS |
For research, pipelines, and analytical workflows where you control the data and need results fast without infrastructure overhead, DuckDB is often the right tool. For production applications with concurrent users and write-heavy workloads, PostGIS remains the standard.
A Practical Research Workflow
My current workflow for exploratory spatial analysis looks like this:
- Download data in Parquet or GeoJSON from a source (OS Open Data, CDRC, ONS Geoportal)
- Open a DuckDB session in Python or from the CLI
INSTALL spatial; LOAD spatial;once- Write SQL queries to filter, join, and aggregate — iterating rapidly without waiting for a server
- Export results to GeoJSON or CSV for visualisation in QGIS or a web map
Steps 1–4 now take minutes rather than the half-hour that a full PostGIS setup required. For research contexts where data exploration and iteration speed are more important than concurrent access, this is a meaningful change.