Machine Learning Named Entity Recognition
The ml-ner feature provides state-of-the-art Named Entity Recognition using transformer models (BERT, RoBERTa, DistilBERT) via ONNX Runtime.
Features
- High Accuracy: Transformer-based models trained on CoNLL-2003
- Auto-Download: Automatically fetch models from HuggingFace Hub
- Caching: Models are cached locally after first download
- Multiple Models: Choose from 5 pre-trained models or bring your own
- Multi-language: Support for 40+ languages with the multilingual model
Entity Types
All pre-trained models recognize four entity types:
- LOC - Locations (cities, countries, regions)
- PER - Persons (names of people)
- ORG - Organizations (companies, institutions)
- MISC - Miscellaneous (dates, events, products)
Installation
Add the ml-ner-download feature to enable both ML-NER and auto-download:
[dependencies]
spatial-narrative = { version = "0.1", features = ["ml-ner-download"] }
Or use just ml-ner if you want to provide your own models:
[dependencies]
spatial-narrative = { version = "0.1", features = ["ml-ner"] }
ONNX Runtime Setup
ML-NER requires ONNX Runtime to be installed. You have several options:
Option 1: Environment Variable
Set ORT_DYLIB_PATH to point to your ONNX Runtime library:
# macOS
export ORT_DYLIB_PATH=/path/to/libonnxruntime.dylib
# Linux
export ORT_DYLIB_PATH=/path/to/libonnxruntime.so
# Windows
set ORT_DYLIB_PATH=C:\path\to\onnxruntime.dll
Option 2: Install via Package Manager
macOS (Homebrew):
brew install onnxruntime
export ORT_DYLIB_PATH=$(brew --prefix onnxruntime)/lib/libonnxruntime.dylib
Linux (Ubuntu/Debian):
sudo apt install libonnxruntime
export ORT_DYLIB_PATH=/usr/lib/libonnxruntime.so
Option 3: Manual Download
Download from ONNX Runtime releases:
- Download the appropriate package for your platform
- Extract the archive
- Set
ORT_DYLIB_PATHto the library file
Available Models
| Model | Size | F1 Score | Speed | Languages |
|---|---|---|---|---|
| DistilBertQuantized | ~65MB | ~90% | Fast | English |
| DistilBert | ~250MB | ~90% | Fast | English |
| BertBase | ~400MB | ~91% | Medium | English |
| BertLarge | ~1.2GB | ~93% | Slow | English |
| Multilingual | ~700MB | ~90% | Medium | 40+ languages |
The DistilBertQuantized model is recommended for most use cases, offering the best balance of size, speed, and accuracy.
Basic Usage
Auto-Download (Recommended)
The simplest way to get started:
use spatial_narrative::text::{MlNerModel, NerModel};
// First run downloads ~65MB, subsequent runs load from cache
let model = MlNerModel::download_blocking(NerModel::DistilBertQuantized)?;
let text = "Dr. Sarah Chen presented her research in Paris on March 15, 2024.";
let entities = model.extract(text)?;
for entity in entities {
println!("{}: \"{}\" (confidence: {:.2})",
entity.label, entity.text, entity.score);
}
// Output:
// PER: "Dr. Sarah Chen" (confidence: 0.99)
// LOC: "Paris" (confidence: 0.98)
// MISC: "March 15, 2024" (confidence: 0.95)
With Progress Reporting
Show download progress for large models:
use spatial_narrative::text::{MlNerModel, NerModel};
let model = MlNerModel::download_blocking_with_progress(
NerModel::DistilBertQuantized,
|downloaded, total| {
if total > 0 {
let pct = (downloaded as f64 / total as f64) * 100.0;
println!("Downloading: {:.1}%", pct);
}
}
)?;
Using Different Models
use spatial_narrative::text::{MlNerModel, NerModel};
// For best accuracy (larger download)
let model = MlNerModel::download_blocking(NerModel::BertLarge)?;
// For multilingual text
let model = MlNerModel::download_blocking(NerModel::Multilingual)?;
// For custom HuggingFace models
let model = MlNerModel::download_blocking(
NerModel::Custom("my-org/my-ner-model".into())
)?;
Advanced Usage
Manual Model Loading
If you have pre-downloaded ONNX models:
use spatial_narrative::text::MlNerModel;
let model = MlNerModel::from_directory("./my-ner-model/")?;
// Directory should contain: model.onnx, tokenizer.json, config.json
Cache Management
use spatial_narrative::text::{
model_cache_dir,
model_cache_path,
is_model_cached,
cache_size_bytes,
clear_model_cache,
NerModel,
};
// Check cache location
println!("Cache dir: {:?}", model_cache_dir());
// Check if a model is cached
let model = NerModel::DistilBertQuantized;
if is_model_cached(&model) {
println!("Model already cached at: {:?}", model_cache_path(&model));
}
// Get total cache size
if let Ok(size) = cache_size_bytes() {
println!("Cache size: {:.2} MB", size as f64 / 1024.0 / 1024.0);
}
// Clear cache for a specific model
clear_model_cache(Some(&model))?;
// Clear all cached models
clear_model_cache(None)?;
Async API
For async applications, use the async API:
use spatial_narrative::text::{MlNerModel, NerModel};
let model = MlNerModel::download(NerModel::DistilBertQuantized).await?;
let entities = model.extract("Text to analyze")?;
Integration with Geoparsing
Combine ML-NER with gazetteer lookup for comprehensive geoparsing:
use spatial_narrative::text::{MlNerModel, NerModel};
use spatial_narrative::parser::{BuiltinGazetteer, Gazetteer};
// Extract entities with ML
let ml_model = MlNerModel::download_blocking(NerModel::DistilBertQuantized)?;
let text = "The summit was held in Geneva, Switzerland.";
let ml_entities = ml_model.extract(text)?;
// Resolve locations with gazetteer
let gazetteer = BuiltinGazetteer::new();
for ml_entity in ml_entities {
if ml_entity.label == "LOC" {
// Convert to standard Entity and lookup coordinates
let entity = ml_entity.to_entity();
if let Some(location) = gazetteer.lookup(&entity.text) {
println!("{} is at {}, {}",
entity.text, location.lat, location.lon);
}
}
}
Entity Structure
The MlEntity struct provides detailed extraction results:
pub struct MlEntity {
/// Entity type: "LOC", "PER", "ORG", or "MISC"
pub label: String,
/// The extracted text
pub text: String,
/// Confidence score (0.0 to 1.0)
pub score: f64,
/// Character position in original text
pub start: usize,
/// End position in original text
pub end: usize,
}
Convert to standard Entity for use with other components:
let entity = ml_entity.to_entity();
// Returns Entity with appropriate EntityType enum
Example Application
See the complete example:
cargo run --example ml_ner_download --features ml-ner-download
This example demonstrates:
- Checking cache status
- Auto-downloading models
- Extracting entities from various texts
- Integration with geoparsing workflow
Exporting Custom Models
To use your own fine-tuned models:
- Train or fine-tune a token classification model on HuggingFace
- Export to ONNX using Optimum:
pip install optimum[exporters]
optimum-cli export onnx --model your-model-name ./output-dir/
- Load in spatial-narrative:
let model = MlNerModel::from_directory("./output-dir/")?;
Or host on HuggingFace Hub and use:
let model = MlNerModel::download_blocking(
NerModel::Custom("your-org/your-model".into())
)?;
Performance Tips
- Choose the right model: Use DistilBertQuantized for most applications
- Cache models: First download takes time, but subsequent loads are fast
- Batch processing: Process multiple texts in sequence after loading once
- Model lifecycle: Keep the model in memory for repeated extractions
- Async for I/O: Use async API when downloading in web servers
Troubleshooting
ONNX Runtime Not Found
If you see errors about ONNX Runtime:
- Install ONNX Runtime (see setup section above)
- Set
ORT_DYLIB_PATHenvironment variable - Verify the path points to the correct library file
Model Download Fails
- Check internet connection
- Verify HuggingFace Hub is accessible
- Try clearing cache:
clear_model_cache(None)? - Check cache directory permissions
Low Accuracy
- Try a larger model (BertBase or BertLarge)
- For non-English text, use the Multilingual model
- Consider fine-tuning a custom model on your domain
Cache Locations
Models are cached in platform-specific directories:
- Linux:
~/.cache/spatial-narrative/models/ - macOS:
~/Library/Caches/spatial-narrative/models/ - Windows:
%LOCALAPPDATA%\spatial-narrative\models\
Each model has its own subdirectory containing:
model.onnx- The neural network modeltokenizer.json- Text tokenization configurationconfig.json- Label mappings and metadata
License Notes
- DistilBERT models: Apache 2.0 License
- BERT models: Apache 2.0 License
- Multilingual model: CC BY-NC-SA 4.0 License (non-commercial)
- ONNX Runtime: MIT License
Check individual model licenses on HuggingFace Hub before commercial use.