Place names in text are not coordinates. Between a name and a map pin lies a sequence of decisions — which country, which administrative level, which historical moment — that standard geocoding tools make silently, without awareness of document context. For most applications, silent errors are a minor nuisance. For research on conflict, forced displacement, and modern slavery, they corrupt the evidentiary record.

Topodex is a Python library developed at the University of Nottingham that adds two layers of contextual intelligence to open geocoders, designed specifically for evidence-dense conflict and human rights corpora. Rather than replacing established gazetteer infrastructure, it wraps nine open geocoding sources with a disambiguation engine that reasons from document context, and a post-resolution coherence validator that checks whether the full set of locations in a document is geographically plausible.

Every geocoding decision is fully auditable: chosen coordinate, ranked alternatives, per-component score breakdown, failure-mode risk flags, disambiguation profile, and coherence check result.

Topodex dashboard showing single toponym input with document context and temporal range fields — The Topodex dashboard — toponym input, document context, temporal range, and LLM-synthesised candidate toggle, ready for resolution

Why Conflict Geocoding Fails

Much of the evidence base for research on slavery, mass violence, and forced displacement is textual: NGO field reports, survivor testimonies, news wire dispatches, displacement tracking records, and open-source intelligence summaries. These sources are dense with place names — villages, transit routes, detention sites, mining camps — that are essential for understanding the geographic structure of exploitation in conflict.

Standard geocoding tools treat place names as simple lookup keys. They return the most “important” result matching a string — typically the largest settlement with that name, resolved against the current political map. This approach fails systematically in conflict corpora:

Locality names are ambiguous. Dozens of villages may share a name across a country. The geocoder returns the most prominent one, which is almost never the right one.
Administrative boundaries shift during conflict. Areas change hands, are renamed, or are absorbed into new structures. The current map does not reflect the geography as it was when the document was written.
Names are transcribed inconsistently. Arabic, Burmese, Amharic, and Lingala place names are romanised differently across sources, producing degraded or failed matches.
Small sites are invisible. Forced labour sites, transit camps, and rural detention locations are precisely the places that standard geocoders rank lowest — overwhelmed by the nearest major city.

For research that depends on spatial analysis — mapping exploitation indicators, tracking displacement patterns, identifying concentration of violence — silent geocoding errors are not a minor inconvenience. They corrupt analysis and undermine policy-relevant conclusions.

Layer 1: Contextual Disambiguation

Before querying any geocoder, Topodex builds a disambiguation profile for each toponym using the surrounding document text and temporal context. This profile captures the implied administrative level, temporal context, co-occurring place names, language and transliteration signals, and structured failure-mode risk flags.

The profile drives a seven-component weighted scoring system that reranks each geocoder’s candidate list, replacing generic popularity with a context-aware relevance measure:

Component	Default Weight	Role
`admin_level_match`	0.25	Asymmetric decay: over-precision penalised more than under-precision
`country_match`	0.20	Tiered scoring; boosted or penalised by implied region from profile
`proximity_to_coentities`	0.18	Mean distance to already-resolved co-entities in the document
`temporal_plausibility`	0.12	Checks Wikidata inception/dissolved and OSM start/end dates
`name_similarity`	0.10	Fuzzy string matching via rapidfuzz; floored to prevent collapse
`language_match`	0.08	Candidate country against language-to-country map from profile
`source_importance`	0.07	Geocoder’s own popularity score as a soft prior

GeoAnchor documentation page showing the ranking engine with all seven components and default weights — The GeoAnchor ranking engine reference — seven components, default weights, and descriptions, as documented in the GeoAnchor reference guide

A corroboration bonus (+0.04 for two independent sources, +0.06 for three or more) rewards candidates confirmed across multiple geocoders. Risk flags detected in the disambiguation profile dynamically adjust the weight budget before scoring: a homonym_dominance flag boosts proximity and name similarity while suppressing popularity; a temporal_anachronism flag doubles the weight on temporal plausibility.

Layer 2: Spatial Coherence Validation

After all toponyms in a document are resolved, Topodex checks whether the resolved set makes geographic sense. Flags are raised when:

Resolved locations span implausibly large distances given the document’s apparent geographic scope
One location is a statistical outlier relative to the cluster — a strong signal of a wrong-country match
An LLM-based review detects contextual contradictions between co-occurring place names

The output is a Spatial Coherence Score (0–1) for the document as a whole, plus itemised issues for researcher review.

Nine Open Geocoder Adapters

Topodex integrates nine open geocoding sources requiring no proprietary API access:

Geocoder	Coverage Strength
Nominatim (OpenStreetMap)	Global, high density in Europe and North America
GeoNames	Global, 11M+ feature names, extensive cross-language coverage
Wikidata SPARQL	Global, historical place data with inception/dissolution dates
OSM Overpass	Native-language name tags; Arabic, Cyrillic, Burmese script support
OpenHistoricalMap	Historical boundaries with explicit start/end dates
NGA GEOnet Names Server	6M+ entries; strong coverage of conflict regions in Africa, MENA, Southeast Asia
Overture Maps	High-quality place layer for regional offline use
Pleiades	Ancient and classical world gazetteer
Who’s on First	Open administrative gazetteer with stable identifiers

Topodex datasets page listing all nine geocoder backends with coverage descriptions and API types — The datasets catalogue — all nine geocoder backends, their coverage strengths, and integration type, browsable from the Topodex interface

Failure Mode Taxonomy

Topodex formally classifies six systematic failure modes that recur across conflict and forced labour corpora. Each toponym is assessed against all six, and detected flags are surfaced in the audit output with evidence and a suggested corrective action:

Failure Mode	Description	Typical Context
Proximity Collapse	A regional city dominates results, obscuring the actual small locality	Forced labour sites, rural transit camps near major urban centres
Temporal Anachronism	Present-day geography returned for a historical document context	Darfuri localities after administrative reorganisation; colonial-era corpora
Homonym Dominance	A place name common to another country outranks the contextually correct result	Shared names across the DRC/CAR/South Sudan tri-border area
Spatial Incoherence	Resolved locations are geographically incompatible with the document’s implied scope	Detected post-batch; catches mis-geocoded records before they corrupt spatial analysis
Admin Level Mismatch	Geocoder returns a region or state when a specific village or camp was named	Conflates a governorate-level entity with an actual site of exploitation
Transliteration Drift	Inconsistent romanisation causes a failed or degraded match	Affects virtually all corpora covering Sub-Saharan Africa, MENA, and Southeast Asia

Auditable by Design

For research that may contribute to litigation, truth commissions, or policy briefs, the provenance of a geographic claim matters. Every resolved toponym exposes a complete audit trail. No geocoding step is a black box:

TOPONYM: Prijedor
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✓ CHOSEN: 44.98°N, 16.71°E (Prijedor, Republika Srpska, Bosnia)
  Source: Nominatim OSM + Wikidata (corroborated)

SCORE BREAKDOWN:
  admin_level_match:       0.850
  country_match:           1.000
  proximity_to_coentities: 0.923   (anchored to Kozarac, Sanski Most)
  temporal_plausibility:   1.000   (Wikidata: entity existed 1992–1995)
  name_similarity:         0.941
  language_match:          1.000
  source_importance:       0.712
  corroboration_bonus:     0.040   (2 independent sources)
  weighted_total:          0.921

RISK FLAGS:
  homonym_dominance — MITIGATED
    Evidence: Croatian Prijedor ranked #3, score 1.2/10
    Action: Context weight boosted Bosnian municipal centre

SPATIAL COHERENCE:
  Document score: 0.87/1.0
  Prijedor consistent with co-located toponyms
  No outliers detected

This transparency is not a convenience feature — it is the mechanism by which researchers retain control over computational geographic claims, and through which those claims can be challenged, corrected, and reused.

Topodex resolving 'Cambridge' showing ranked candidates, score breakdown panel, homonym dominance flag, and map — Resolving "Cambridge" — ranked candidates with score breakdown, a detected Homonym Dominance flag, and the chosen result pinned on the map

Application Domains

Conflict event verification — ACLED and similar conflict event databases assign coordinates to events extracted from source texts. Topodex verifies those assignments against the source document and flags likely errors before they propagate into spatial analysis. A native ACLED adapter is included.

Survivor and testimony analysis — Where oral histories and survivor accounts name places, Topodex inserts as the geocoding layer downstream of Named Entity Recognition, converting extracted place names to verified coordinates with full auditability.

Forced labour site mapping — NGO reports and displacement records mention labour sites, transit camps, and trafficking hubs — precisely the small, rural locations that standard geocoders rank lowest. Topodex resolves to the actual small locality rather than the nearest city, enabling spatial clustering of exploitation evidence.

Historical corpus geocoding — Topodex’s temporal context parameter handles historical geography directly, using Wikidata’s historical place data and OpenHistoricalMap alongside modern gazetteer sources. Tested on ICTY judgments (Bosnia, 1992–1995) and 19th-century colonial administrative texts.

Evaluation

Topodex is evaluated against an annotated benchmark of 42 toponyms across 12 conflict zones, drawn from ICTY judgments (Bosnia, 1992–1995), ReliefWeb humanitarian situation reports (Sudan, DRC, Myanmar, Syria), and ACLED event data.

The evaluation framework reports Mean Reciprocal Rank (MRR), Precision@k, Failure Mode Recall across all six categories, and the Spatial Coherence Score. Baselines include Nominatim alone, mordecai3, CLIFF-CLAVIN, Edinburgh Geoparser, and GeoTxt. Ablation experiments report the marginal contribution of each scoring component.

The corpus expansion target is 300–500 annotated toponyms across Bosnia, Sudan, DRC, Myanmar, Haiti, and historical English-language sources.

Adaptive Scoring

Topodex supports learned weight optimisation using a Bayesian Personalised Ranking (BPR) objective over annotated candidate pairs. Weights can be fitted to a specific conflict zone’s corpus, exposing whether hand-tuned defaults generalise across domains or require regional calibration.

Researcher corrections made through the review interface are stored and incorporated into the training pool, enabling an online feedback loop: scoring weights adapt passively as researchers use and correct the system. Preliminary simulations suggest that 20–30 corrections are sufficient to adapt weights to a new conflict zone.

Research Significance

Geocoding is often treated as a solved problem — a lookup, not a research decision. In conflict and human rights research, that assumption is wrong, and its consequences are not trivial. A misresolved village displaces a forced labour site by 50 kilometres. A wrong-country match inverts a displacement pattern. A temporal anachronism places survivors in a geography that did not exist when their testimony was recorded.

Topodex addresses these failures not by replacing geocoders but by making their outputs accountable to document context. The result is a system that is simultaneously more accurate and more transparent — one where every geographic claim can be traced, challenged, and corrected by the researchers who depend on it.

The five research contributions — a formal failure mode taxonomy, context-aware candidate reranking, spatial coherence validation, a full audit chain, and multi-source corroboration — each address a distinct gap in the existing geocoding literature, and together constitute a framework specifically designed for the evidentiary standards of conflict and human rights scholarship.

Topodex