Topodex
Projects
Ongoing 2026–present research

Topodex

Contextual geocoding and spatial coherence validation for conflict and human rights text corpora.

Python NLP Nominatim Pleiades OpenHistoricalMap Live Demo
9 geocoders
7 scoring weights
6 failure modes
MIT open source

Place names in text are not coordinates. Between a name and a map pin lies a sequence of decisions — which country, which administrative level, which historical moment — that standard geocoding tools make silently, without awareness of document context. For most applications, silent errors are a minor nuisance. For research on conflict, forced displacement, and modern slavery, they corrupt the evidentiary record.

Topodex is a Python library developed at the University of Nottingham that adds two layers of contextual intelligence to open geocoders, designed specifically for evidence-dense conflict and human rights corpora. Rather than replacing established gazetteer infrastructure, it wraps nine open geocoding sources with a disambiguation engine that reasons from document context, and a post-resolution coherence validator that checks whether the full set of locations in a document is geographically plausible.

Every geocoding decision is fully auditable: chosen coordinate, ranked alternatives, per-component score breakdown, failure-mode risk flags, disambiguation profile, and coherence check result.

Topodex dashboard showing single toponym input with document context and temporal range fields
The Topodex dashboard — toponym input, document context, temporal range, and LLM-synthesised candidate toggle, ready for resolution

Why Conflict Geocoding Fails

Much of the evidence base for research on slavery, mass violence, and forced displacement is textual: NGO field reports, survivor testimonies, news wire dispatches, displacement tracking records, and open-source intelligence summaries. These sources are dense with place names — villages, transit routes, detention sites, mining camps — that are essential for understanding the geographic structure of exploitation in conflict.

Standard geocoding tools treat place names as simple lookup keys. They return the most “important” result matching a string — typically the largest settlement with that name, resolved against the current political map. This approach fails systematically in conflict corpora:

  • Locality names are ambiguous. Dozens of villages may share a name across a country. The geocoder returns the most prominent one, which is almost never the right one.
  • Administrative boundaries shift during conflict. Areas change hands, are renamed, or are absorbed into new structures. The current map does not reflect the geography as it was when the document was written.
  • Names are transcribed inconsistently. Arabic, Burmese, Amharic, and Lingala place names are romanised differently across sources, producing degraded or failed matches.
  • Small sites are invisible. Forced labour sites, transit camps, and rural detention locations are precisely the places that standard geocoders rank lowest — overwhelmed by the nearest major city.

For research that depends on spatial analysis — mapping exploitation indicators, tracking displacement patterns, identifying concentration of violence — silent geocoding errors are not a minor inconvenience. They corrupt analysis and undermine policy-relevant conclusions.

Layer 1: Contextual Disambiguation

Before querying any geocoder, Topodex builds a disambiguation profile for each toponym using the surrounding document text and temporal context. This profile captures the implied administrative level, temporal context, co-occurring place names, language and transliteration signals, and structured failure-mode risk flags.

The profile drives a seven-component weighted scoring system that reranks each geocoder’s candidate list, replacing generic popularity with a context-aware relevance measure:

ComponentDefault WeightRole
admin_level_match0.25Asymmetric decay: over-precision penalised more than under-precision
country_match0.20Tiered scoring; boosted or penalised by implied region from profile
proximity_to_coentities0.18Mean distance to already-resolved co-entities in the document
temporal_plausibility0.12Checks Wikidata inception/dissolved and OSM start/end dates
name_similarity0.10Fuzzy string matching via rapidfuzz; floored to prevent collapse
language_match0.08Candidate country against language-to-country map from profile
source_importance0.07Geocoder’s own popularity score as a soft prior
GeoAnchor documentation page showing the ranking engine with all seven components and default weights
The GeoAnchor ranking engine reference — seven components, default weights, and descriptions, as documented in the GeoAnchor reference guide

A corroboration bonus (+0.04 for two independent sources, +0.06 for three or more) rewards candidates confirmed across multiple geocoders. Risk flags detected in the disambiguation profile dynamically adjust the weight budget before scoring: a homonym_dominance flag boosts proximity and name similarity while suppressing popularity; a temporal_anachronism flag doubles the weight on temporal plausibility.

Layer 2: Spatial Coherence Validation

After all toponyms in a document are resolved, Topodex checks whether the resolved set makes geographic sense. Flags are raised when:

  • Resolved locations span implausibly large distances given the document’s apparent geographic scope
  • One location is a statistical outlier relative to the cluster — a strong signal of a wrong-country match
  • An LLM-based review detects contextual contradictions between co-occurring place names

The output is a Spatial Coherence Score (0–1) for the document as a whole, plus itemised issues for researcher review.

Nine Open Geocoder Adapters

Topodex integrates nine open geocoding sources requiring no proprietary API access:

GeocoderCoverage Strength
Nominatim (OpenStreetMap)Global, high density in Europe and North America
GeoNamesGlobal, 11M+ feature names, extensive cross-language coverage
Wikidata SPARQLGlobal, historical place data with inception/dissolution dates
OSM OverpassNative-language name tags; Arabic, Cyrillic, Burmese script support
OpenHistoricalMapHistorical boundaries with explicit start/end dates
NGA GEOnet Names Server6M+ entries; strong coverage of conflict regions in Africa, MENA, Southeast Asia
Overture MapsHigh-quality place layer for regional offline use
PleiadesAncient and classical world gazetteer
Who’s on FirstOpen administrative gazetteer with stable identifiers
Topodex datasets page listing all nine geocoder backends with coverage descriptions and API types
The datasets catalogue — all nine geocoder backends, their coverage strengths, and integration type, browsable from the Topodex interface

Failure Mode Taxonomy

Topodex formally classifies six systematic failure modes that recur across conflict and forced labour corpora. Each toponym is assessed against all six, and detected flags are surfaced in the audit output with evidence and a suggested corrective action:

Failure ModeDescriptionTypical Context
Proximity CollapseA regional city dominates results, obscuring the actual small localityForced labour sites, rural transit camps near major urban centres
Temporal AnachronismPresent-day geography returned for a historical document contextDarfuri localities after administrative reorganisation; colonial-era corpora
Homonym DominanceA place name common to another country outranks the contextually correct resultShared names across the DRC/CAR/South Sudan tri-border area
Spatial IncoherenceResolved locations are geographically incompatible with the document’s implied scopeDetected post-batch; catches mis-geocoded records before they corrupt spatial analysis
Admin Level MismatchGeocoder returns a region or state when a specific village or camp was namedConflates a governorate-level entity with an actual site of exploitation
Transliteration DriftInconsistent romanisation causes a failed or degraded matchAffects virtually all corpora covering Sub-Saharan Africa, MENA, and Southeast Asia

Auditable by Design

For research that may contribute to litigation, truth commissions, or policy briefs, the provenance of a geographic claim matters. Every resolved toponym exposes a complete audit trail. No geocoding step is a black box:

TOPONYM: Prijedor
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✓ CHOSEN: 44.98°N, 16.71°E (Prijedor, Republika Srpska, Bosnia)
  Source: Nominatim OSM + Wikidata (corroborated)

SCORE BREAKDOWN:
  admin_level_match:       0.850
  country_match:           1.000
  proximity_to_coentities: 0.923   (anchored to Kozarac, Sanski Most)
  temporal_plausibility:   1.000   (Wikidata: entity existed 1992–1995)
  name_similarity:         0.941
  language_match:          1.000
  source_importance:       0.712
  corroboration_bonus:     0.040   (2 independent sources)
  weighted_total:          0.921

RISK FLAGS:
  homonym_dominance — MITIGATED
    Evidence: Croatian Prijedor ranked #3, score 1.2/10
    Action: Context weight boosted Bosnian municipal centre

SPATIAL COHERENCE:
  Document score: 0.87/1.0
  Prijedor consistent with co-located toponyms
  No outliers detected

This transparency is not a convenience feature — it is the mechanism by which researchers retain control over computational geographic claims, and through which those claims can be challenged, corrected, and reused.

Topodex resolving 'Cambridge' showing ranked candidates, score breakdown panel, homonym dominance flag, and map
Resolving "Cambridge" — ranked candidates with score breakdown, a detected Homonym Dominance flag, and the chosen result pinned on the map

Application Domains

Conflict event verification — ACLED and similar conflict event databases assign coordinates to events extracted from source texts. Topodex verifies those assignments against the source document and flags likely errors before they propagate into spatial analysis. A native ACLED adapter is included.

Survivor and testimony analysis — Where oral histories and survivor accounts name places, Topodex inserts as the geocoding layer downstream of Named Entity Recognition, converting extracted place names to verified coordinates with full auditability.

Forced labour site mapping — NGO reports and displacement records mention labour sites, transit camps, and trafficking hubs — precisely the small, rural locations that standard geocoders rank lowest. Topodex resolves to the actual small locality rather than the nearest city, enabling spatial clustering of exploitation evidence.

Historical corpus geocoding — Topodex’s temporal context parameter handles historical geography directly, using Wikidata’s historical place data and OpenHistoricalMap alongside modern gazetteer sources. Tested on ICTY judgments (Bosnia, 1992–1995) and 19th-century colonial administrative texts.

Evaluation

Topodex is evaluated against an annotated benchmark of 42 toponyms across 12 conflict zones, drawn from ICTY judgments (Bosnia, 1992–1995), ReliefWeb humanitarian situation reports (Sudan, DRC, Myanmar, Syria), and ACLED event data.

The evaluation framework reports Mean Reciprocal Rank (MRR), Precision@k, Failure Mode Recall across all six categories, and the Spatial Coherence Score. Baselines include Nominatim alone, mordecai3, CLIFF-CLAVIN, Edinburgh Geoparser, and GeoTxt. Ablation experiments report the marginal contribution of each scoring component.

The corpus expansion target is 300–500 annotated toponyms across Bosnia, Sudan, DRC, Myanmar, Haiti, and historical English-language sources.

Adaptive Scoring

Topodex supports learned weight optimisation using a Bayesian Personalised Ranking (BPR) objective over annotated candidate pairs. Weights can be fitted to a specific conflict zone’s corpus, exposing whether hand-tuned defaults generalise across domains or require regional calibration.

Researcher corrections made through the review interface are stored and incorporated into the training pool, enabling an online feedback loop: scoring weights adapt passively as researchers use and correct the system. Preliminary simulations suggest that 20–30 corrections are sufficient to adapt weights to a new conflict zone.

Research Significance

Geocoding is often treated as a solved problem — a lookup, not a research decision. In conflict and human rights research, that assumption is wrong, and its consequences are not trivial. A misresolved village displaces a forced labour site by 50 kilometres. A wrong-country match inverts a displacement pattern. A temporal anachronism places survivors in a geography that did not exist when their testimony was recorded.

Topodex addresses these failures not by replacing geocoders but by making their outputs accountable to document context. The result is a system that is simultaneously more accurate and more transparent — one where every geographic claim can be traced, challenged, and corrected by the researchers who depend on it.

The five research contributions — a formal failure mode taxonomy, context-aware candidate reranking, spatial coherence validation, a full audit chain, and multi-source corroboration — each address a distinct gap in the existing geocoding literature, and together constitute a framework specifically designed for the evidentiary standards of conflict and human rights scholarship.