Background
NTE segments
Existing X12 EDI de-identification approaches handle structured segments reliably — PHI in NM1, N3/N4, and DMG sits in predictable positions with known qualifiers. The NTE segment presents a different challenge: it is a free-text field embedded within an otherwise structured transaction, appearing in Loop 2300 (claim-level) and Loop 2400 (service line) of 837P claims. NTE content contains PHI in natural language — patient names, dates of birth, member IDs — without predictable element positions.
Results
Benchmark results
Four models were evaluated on 500 synthetic 837P transactions, 162 of which contained NTE segments with embedded PHI. NTE content was generated using Qwen 2.5 7B-Instruct to produce diverse clinical notes with varied phrasing and casing.
Overall performance
Swipe table →
| Model | Relaxed F1 | Strict F1 |
|---|---|---|
| Sentedel v2.1 | 99.7% | 89.9% |
| OpenAI Privacy Filter (baseline) | 30.0% | 1.4% |
| GLiNER PII Base | 58.0% | 52.8% |
| NVIDIA GLiNER PII | 43.4% | 37.8% |
v2.1 achieves 99.7% relaxed F1 and 89.9% strict F1, improving over v1 (86.0% / 78.3%) on a harder evaluation that includes free-text PHI.
Structured vs NTE recall
NTE-specific recall by model:
Swipe table →
| Model | Structured Recall | NTE Recall |
|---|---|---|
| Sentedel v2.1 | 99.7% | 93.0% |
| OpenAI Privacy Filter | 65.2% | 67.1% |
| GLiNER PII Base | 53.9% | 29.4% |
| NVIDIA GLiNER PII | 38.2% | 40.8% |
GLiNER PII Base achieves 29.4% NTE recall; NVIDIA GLiNER reaches 40.8%; the unfine-tuned Privacy Filter reaches 67.1%. Sentedel v2.1 achieves 93.0% NTE recall while maintaining 99.7% on structured segments.
Approach
Methodology
v2 tested routing structured segments to the fine-tuned model and NTE text to a clinical de-ID model (obi/deid_roberta_i2b2). The clinical model, trained on lowercase EHR narratives, underperformed on uppercase EDI-style NTE text — achieving 68.2% NTE recall vs 99.3% from the structural model alone.
A single model fine-tuned on both structured and NTE-containing data outperforms the specialized dual-model routing architecture for this task.
LLM-generated NTE templates
v2 used 20 handcrafted NTE templates, which were predictable enough to achieve 99.3% recall without genuine generalization. v2.1 replaces these with LLM-generated templates from Qwen 2.5 7B-Instruct (4-bit quantized), producing clinical notes with placeholder tokens (<<PATIENT_NAME>>, <<MEMBER_ID>>, etc.) across varied styles and clinical contexts.
During EDI generation, placeholders are replaced with Faker-generated PHI values, providing exact character-level labels through offset tracking. This yields 217 unique templates and a more representative evaluation — NTE recall dropped from 99.3% to 93.0%, confirming the model is tested against diverse clinical language.
Per-category
Per-category recall
Recall at moderate (60%) threshold across all 13 PHI categories:
Swipe table →
| Category | Baseline | v2.1 | Improvement |
|---|---|---|---|
| patient_address | 39.3% | 100.0% | +60.7% |
| provider_address | 35.3% | 100.0% | +64.7% |
| entity_id | 51.7% | 100.0% | +48.3% |
| patient_name | 51.1% | 96.9% | +45.9% |
| npi | 65.8% | 100.0% | +34.2% |
| phone_number | 73.3% | 100.0% | +26.7% |
| member_id | 75.9% | 100.0% | +24.1% |
| contact_name | 76.7% | 97.9% | +21.3% |
| date_of_birth | 84.7% | 99.8% | +15.2% |
| group_number | 86.2% | 100.0% | +13.8% |
| tax_id | 92.4% | 100.0% | +7.6% |
| claim_id | 96.0% | 100.0% | +4.0% |
| email_address | 99.6% | 100.0% | +0.4% |
Technical details
Base model: openai/privacy-filter (1.5B params, 50M active, MoE) Training data: 5,000 synthetic 837P transactions (35% with NTE segments) NTE templates: 217 unique (137 LLM-generated + 80 handcrafted fallback) Training split: 4,200 train / 300 val / 500 test Label strategy: S-tag alignment into existing PII categories Training: 3 epochs, lr=3e-4, bf16, dynamic padding, batch 4 (eff. 16) Evaluation: Multi-threshold (strict/relaxed), structured vs NTE breakdown
Try it
Demo
The v2.1 model is available as a demo API for X12 text including NTE segments.
Research demo — do not submit real patient data.