Skip to main content
/ 7 min read

Cloud-Native Geospatial: GeoParquet, COGs, and STAC

The modern formats and cataloguing standards that are reshaping how spatial data is stored, accessed, and shared

Featured image for Cloud-Native Geospatial: GeoParquet, COGs, and STAC - The modern formats and cataloguing standards that are reshaping how spatial data is stored, accessed, and shared

"The geospatial industry is in the middle of a format revolution. GeoJSON files copied to S3 buckets, shapefiles emailed between teams, and WMS services that time out under load are giving way to a new stack built around cloud-native principles: data that can be accessed in parts, indexed without downloading, and queried directly from object storage."

The traditional geospatial data model was designed around files: you download the whole file, open it in a desktop GIS, and work with it locally. This works fine for a parish boundary shapefile. It breaks down when you are trying to query a 200 GB satellite imagery archive or access a billion-row GPS dataset from a Jupyter notebook.

Cloud-native geospatial formats solve this by making data range-readable: you can fetch specific bytes from a remote file rather than downloading the whole thing. Combined with a discovery layer that tells you what data exists and where, the result is a system where analysis happens at the data rather than the data being moved to the analysis.

This post covers the three core components of the modern cloud-native geo stack.

GeoParquet: Cloud-Native Vector Data

GeoParquet is an open standard that encodes vector geometry into the Apache Parquet columnar format. Parquet is already the default format for large-scale data engineering (used by Spark, DuckDB, BigQuery, and Snowflake). GeoParquet extends it with geometry encoding conventions and CRS metadata.

Why Parquet for Geospatial?

Traditional vector formats have significant limitations at scale:

FormatMax file sizeColumn pruningPredicate pushdownCloud-native
Shapefile~2 GB
GeoJSONUnlimited
GeoPackageUnlimited
GeoParquetUnlimited

Column pruning means reading only the columns you request without scanning the full file. Predicate pushdown means filtering rows at the storage layer before they reach your process. For a 10-column, 50M-row dataset, these features reduce I/O by orders of magnitude.

Reading and Writing GeoParquet

GeoPandas 0.13+ supports GeoParquet natively:

import geopandas as gpd

# Write
gdf = gpd.read_file("parishes.gpkg")
gdf.to_parquet("parishes.parquet")

# Read back
gdf_back = gpd.read_parquet("parishes.parquet")

# Read directly from S3 (no download)
gdf_s3 = gpd.read_parquet(
    "s3://my-bucket/spatial-data/output_areas_2021.parquet"
)

Partial Reads with PyArrow

For very large files, you can apply filters at read time without loading the full dataset:

import pyarrow.parquet as pq
import pyarrow.dataset as ds
import geopandas as gpd

# Filter rows by attribute before loading geometries
dataset = ds.dataset("s3://my-bucket/gps_points.parquet")
table = dataset.to_table(
    filter=ds.field("timestamp") > "2026-01-01",
    columns=["id", "lat", "lng", "geometry", "speed"]
)
gdf = gpd.GeoDataFrame.from_arrow(table)

This is the cloud-native advantage: the S3 storage layer returns only the row groups matching your filter, and only the columns you need.

Cloud Optimized GeoTIFF (COG): Cloud-Native Raster

A Cloud Optimized GeoTIFF is a regular GeoTIFF with its internal structure rearranged so that HTTP range requests can efficiently retrieve overviews (lower-resolution thumbnails) and spatial tiles without reading the full file.

The structure of a COG looks like this:

┌─────────────────────────────────────────────┐
│  Header (offsets to all data locations)      │
├─────────────────────────────────────────────┤
│  Overview level 4 (coarsest, e.g. 1:250k)   │
│  Overview level 3                            │
│  Overview level 2                            │
│  Overview level 1                            │
├─────────────────────────────────────────────┤
│  Full resolution tiles (512×512 blocks)      │
└─────────────────────────────────────────────┘

A web map viewer can request just the overview for a zoomed-out view, or just the tiles covering the visible viewport — without downloading the multi-GB full-resolution file.

Creating a COG with rio-cogeo

pip install rio-cogeo
from rio_cogeo.cogeo import cog_translate
from rio_cogeo.profiles import cog_profiles

# Convert a standard GeoTIFF to COG
cog_translate(
    "raw/satellite_scene.tif",
    "output/satellite_scene_cog.tif",
    cog_profiles.get("deflate"),   # compression profile
    overview_resampling="average", # antialiasing for downsampled overviews
    overview_levels=[2, 4, 8, 16, 32],
    quiet=False
)

Validate the output:

rio cogeo validate output/satellite_scene_cog.tif

Reading a COG Spatially with Rasterio

Once a COG is on S3 (or any HTTP server), you can read just the region you need:

import rasterio
from rasterio.windows import from_bounds

with rasterio.open("s3://my-bucket/rasters/scene_cog.tif") as src:
    # Define a bounding box (in the file's CRS)
    bbox = (-1.25, 52.88, -1.07, 53.02)  # Nottingham area in WGS84

    # Convert bounding box to pixel window
    window = from_bounds(*bbox, transform=src.transform)

    # Read only the windowed region — fetches only the relevant tiles
    data = src.read(window=window)
    print(f"Read array shape: {data.shape}")  # (bands, rows, cols)

The rasterio driver with GDAL’s VSICURL support handles the range requests transparently. From your code’s perspective it looks like a local file, but only the bytes you need are transferred.

STAC: Discovery for Cloud-Native Data

The SpatioTemporal Asset Catalog (STAC) specification defines a standard JSON schema for describing geospatial assets — satellite images, point clouds, climate datasets — so that they are discoverable, searchable, and interoperable.

A STAC Item represents a single spatiotemporal asset:

{
  "type": "Feature",
  "stac_version": "1.0.0",
  "id": "S2B_MSIL2A_20260312T110609_N0510",
  "geometry": {
    "type": "Polygon",
    "coordinates": [[[...bounding polygon...]]]
  },
  "bbox": [-2.1, 52.4, 0.5, 53.8],
  "properties": {
    "datetime": "2026-03-12T11:06:09Z",
    "platform": "sentinel-2b",
    "eo:cloud_cover": 12.4
  },
  "assets": {
    "B04": {
      "href": "s3://sentinel-cogs/sentinel-s2-l2a-cogs/52/U/CB/2026/3/..._B04.tif",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "title": "Red band (10m)"
    },
    "B08": {
      "href": "s3://sentinel-cogs/sentinel-s2-l2a-cogs/52/U/CB/2026/3/..._B08.tif",
      "type": "image/tiff; application=geotiff; profile=cloud-optimized",
      "title": "Near-Infrared band (10m)"
    }
  }
}

Searching a STAC API

Major providers — Microsoft Planetary Computer, Element84 Earth Search, and others — expose STAC APIs. The pystac-client library makes searching them straightforward:

pip install pystac-client rioxarray
import pystac_client

# Connect to Microsoft Planetary Computer STAC API
catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1"
)

# Search for Sentinel-2 scenes over Nottingham in 2026 with low cloud cover
results = catalog.search(
    collections=["sentinel-2-l2a"],
    bbox=[-1.25, 52.88, -1.07, 53.02],
    datetime="2026-01-01/2026-03-01",
    query={"eo:cloud_cover": {"lt": 20}}
)

items = list(results.items())
print(f"Found {len(items)} scenes")

for item in items[:3]:
    print(f"  {item.id}")
    print(f"  Date: {item.datetime}")
    print(f"  Cloud cover: {item.properties.get('eo:cloud_cover')}%")
    print(f"  B04 URL: {item.assets['B04'].href}")

Computing NDVI Directly from STAC (No Download)

The whole stack comes together when you combine STAC discovery with COG range reading to compute a vegetation index over a specific area without downloading anything:

import rioxarray
import numpy as np

# Get the best scene
item = items[0]
red_url = item.assets["B04"].href   # 10m Red band COG on S3
nir_url = item.assets["B08"].href   # 10m NIR band COG on S3

# Read only the bounding box — range requests fetch only relevant tiles
bbox = [-1.25, 52.88, -1.07, 53.02]

red = rioxarray.open_rasterio(red_url).rio.clip_box(*bbox)
nir = rioxarray.open_rasterio(nir_url).rio.clip_box(*bbox)

# Calculate NDVI
ndvi = (nir - red) / (nir + red)
ndvi = ndvi.where(ndvi != -9999)  # mask nodata

print(f"Mean NDVI: {float(ndvi.mean()):.3f}")
ndvi.squeeze().plot(cmap="RdYlGn", vmin=-0.2, vmax=0.8)

This computation runs on a Nottingham-area crop of two Sentinel-2 bands, fetched as HTTP range requests from the public Planetary Computer storage. Total data transferred: a few megabytes from a 1 GB scene file.

The Broader Ecosystem

The cloud-native geo stack continues to grow:

  • FlatGeobuf — a binary, streamable vector format with a spatial index baked in; useful for direct web delivery
  • PMTiles — a single-file archive of vector or raster map tiles, readable via range requests from S3 or GitHub Pages
  • Zarr — cloud-native multidimensional arrays, increasingly used for climate model outputs and sensor data cubes
  • Overture Maps — Microsoft, Meta, Amazon, and TomTom’s collaborative open map dataset, released in GeoParquet

The common thread is range-readability: structure the file so that a client can fetch the bytes it needs without downloading everything. This removes the traditional bottleneck of moving data to compute, replacing it with compute at the data.

For geospatial engineers working with satellite imagery, large vector datasets, or sensor archives, fluency with this stack — STAC for discovery, COGs for raster access, GeoParquet for vector data — is becoming a baseline expectation at companies like Planet, Mapbox, Esri’s cloud teams, and national mapping agencies.

James Williams
Dr James Williams
Research Fellow

Researching the intersection of place, maps, and technology.

More about me →

Posts on this blog are refined using AI. All ideas, research, and technical content originate with the author; AI assists with drafting and editing.