Skip to main content
/ 6 min read

DuckDB as an In-Process Spatial Analytics Engine

Running complex spatial SQL against millions of records—no server, no infrastructure, just a file and a query

Featured image for DuckDB as an In-Process Spatial Analytics Engine - Running complex spatial SQL against millions of records—no server, no infrastructure, just a file and a query

"DuckDB is one of the most interesting developments in data infrastructure in years. It is an in-process analytical database that runs entirely within your application. With the spatial extension, it becomes a powerful tool for geospatial analytics at a fraction of the usual complexity."

Setting up a PostGIS instance to do spatial analysis is, at minimum, a twenty-minute affair: install PostgreSQL, install PostGIS, configure the server, create a database, load your data, write your queries. For exploratory analysis or research pipelines that need to run on a laptop without a running server, this overhead is often unjustifiable.

DuckDB offers a different model. It is an in-process OLAP database — meaning it runs embedded in your Python process, your Rust binary, or your R session, with no network socket, no background service, and no configuration files. Load a file, write SQL, get results. The spatial extension adds full geometry support on top of this.

Installing the Spatial Extension

The spatial extension is bundled with DuckDB but needs to be explicitly loaded. In any DuckDB session:

INSTALL spatial;
LOAD spatial;

This is a one-time install per machine. After that, LOAD spatial; at the start of each session is sufficient. The extension adds support for the familiar spatial functions from PostGIS: ST_Within, ST_Distance, ST_Intersects, ST_Area, and many more, including geometry type constructors.

Reading Spatial Data from Parquet

One of DuckDB’s strongest features is its native Parquet support. You can query a Parquet file on disk (or in S3) without importing it first:

-- Query a Parquet file of GPS points directly
SELECT
    id,
    lat,
    lng,
    ST_Point(lng, lat) AS geom
FROM read_parquet('gps_points.parquet')
LIMIT 5;

For a file containing 5 million GPS records, this query returns in under a second on a modern laptop. DuckDB reads only the columns it needs (columnar projection pushdown) and uses vectorised execution throughout.

Spatial Filtering with ST_Within

Now the interesting part. Suppose you want to filter all GPS points to those within a specific administrative boundary — say, the City of Nottingham polygon. You can define the polygon inline as WKT:

LOAD spatial;

-- Filter points inside a bounding polygon
WITH boundary AS (
    SELECT ST_GeomFromText(
        'POLYGON((-1.25 52.88, -1.08 52.88, -1.08 53.02, -1.25 53.02, -1.25 52.88))'
    ) AS geom
)
SELECT
    p.id,
    p.lat,
    p.lng,
    ST_Distance(
        ST_Point(p.lng, p.lat)::GEOMETRY,
        ST_Centroid(b.geom)
    ) AS dist_to_centre
FROM read_parquet('gps_points.parquet') p
CROSS JOIN boundary b
WHERE ST_Within(ST_Point(p.lng, p.lat)::GEOMETRY, b.geom)
ORDER BY dist_to_centre
LIMIT 20;

This query does a spatial filter (ST_Within) and computes distance to the centroid of the boundary, all in a single pass over the Parquet file. On 5 million records it typically runs in 2–4 seconds on a standard laptop — without any spatial index.

For repeated queries over the same dataset, materialising it into a DuckDB table with a spatial index improves subsequent query times considerably:

-- Materialise to a DuckDB table for repeated analysis
CREATE TABLE points AS
    SELECT id, lat, lng, ST_Point(lng, lat)::GEOMETRY AS geom
    FROM read_parquet('gps_points.parquet');

CREATE INDEX points_spatial ON points USING RTREE (geom);

Using the Python API

DuckDB’s Python package integrates cleanly with Pandas and Polars, making it natural to use in Jupyter notebooks or analysis scripts:

import duckdb
import pandas as pd

con = duckdb.connect()  # in-memory, or pass a file path to persist

con.execute("INSTALL spatial; LOAD spatial;")

# Query directly to a Pandas DataFrame
result_df = con.execute("""
    SELECT
        id,
        lat,
        lng,
        ST_AsText(ST_Point(lng, lat)) AS wkt
    FROM read_parquet('gps_points.parquet')
    WHERE lat BETWEEN 52.88 AND 53.02
      AND lng BETWEEN -1.25 AND -1.08
""").df()

print(result_df.head())

You can also pass Pandas DataFrames directly into DuckDB queries using the df variable binding:

# Load an existing GeoDataFrame's attribute data into DuckDB
import geopandas as gpd

boundaries_gdf = gpd.read_file("oa_boundaries.geojson")
# Register the GeoDataFrame as a virtual table
con.register("boundaries", boundaries_gdf)

result = con.execute("""
    SELECT
        p.id,
        b."OA21CD" AS area_code
    FROM read_parquet('gps_points.parquet') p
    JOIN boundaries b
        ON ST_Within(
            ST_Point(p.lng, p.lat)::GEOMETRY,
            ST_GeomFromWKB(b.geometry)
        )
""").df()

This pattern — DuckDB as the processing engine, GeoPandas for data loading, Pandas for downstream analysis — avoids the PostGIS infrastructure requirement entirely while maintaining SQL expressiveness.

Reading GeoJSON and Shapefiles

The spatial extension can also read GeoJSON and shapefiles directly:

-- Read a GeoJSON file as a table
SELECT *
FROM ST_Read('output_areas.geojson')
LIMIT 5;

-- Read a Shapefile
SELECT name, ST_Area(geom) AS area_m2
FROM ST_Read('parishes.shp');

ST_Read uses GDAL under the hood, so any GDAL-supported format works. This is useful for ad-hoc analysis where you do not want to import data into a dedicated spatial database first.

Aggregate Spatial Functions

DuckDB supports spatial aggregation, which is where the analytical power becomes clear:

-- Count points per H3 hex (resolution 8) using H3 extension
INSTALL h3 FROM community;
LOAD h3;

SELECT
    h3_latlng_to_cell(lat, lng, 8) AS h3_cell,
    COUNT(*)                        AS point_count,
    AVG(speed)                      AS avg_speed
FROM read_parquet('gps_points.parquet')
GROUP BY h3_cell
ORDER BY point_count DESC
LIMIT 20;

DuckDB also has an H3 community extension that integrates H3 indexing natively into SQL — no Python wrapper needed.

DuckDB vs PostGIS: When to Use Each

DuckDB’s spatial capabilities are genuinely impressive, but they are not a replacement for PostGIS in every scenario:

ScenarioDuckDBPostGIS
Exploratory analysis on a laptop✅ Excellent❌ Overkill
Sharing a query with collaborators✅ Just a file❌ Needs server access
Concurrent read/write applications❌ Single writer✅ Full MVCC
Multi-TB datasets with spatial indexes❌ Limited✅ Mature
Serverless / Lambda deployment✅ Embed directly❌ Hard
Complex topology operations❌ Limited✅ Full GEOS

For research, pipelines, and analytical workflows where you control the data and need results fast without infrastructure overhead, DuckDB is often the right tool. For production applications with concurrent users and write-heavy workloads, PostGIS remains the standard.

A Practical Research Workflow

My current workflow for exploratory spatial analysis looks like this:

  1. Download data in Parquet or GeoJSON from a source (OS Open Data, CDRC, ONS Geoportal)
  2. Open a DuckDB session in Python or from the CLI
  3. INSTALL spatial; LOAD spatial; once
  4. Write SQL queries to filter, join, and aggregate — iterating rapidly without waiting for a server
  5. Export results to GeoJSON or CSV for visualisation in QGIS or a web map

Steps 1–4 now take minutes rather than the half-hour that a full PostGIS setup required. For research contexts where data exploration and iteration speed are more important than concurrent access, this is a meaningful change.

Further Resources

James Williams
Dr James Williams
Research Fellow

Researching the intersection of place, maps, and technology.

More about me →

Posts on this blog are refined using AI. All ideas, research, and technical content originate with the author; AI assists with drafting and editing.