EDI (X12) transactions contain PHI, but they don’t look like normal text. A patient name lives inside delimited segments like NM1*IL*1*JOHNSON*MICHAEL*T***MI*XKW123456789~, not inside sentences. We benchmarked modern PII detectors on synthetic—but structurally valid—payer-side EDI to measure what breaks and what fixes it.
Baseline PII models fragment spans across delimiters and qualifiers, collapsing strict boundary performance.
With a few thousand EDI examples, performance jumps from near-zero strict F1 to production-usable accuracy.
Approach
Synthetic EDI + S-tag alignment
Real payer EDI is PHI by definition, so we generated valid 837P transactions with realistic identifiers (names, addresses, DOBs, IDs) placed in the correct segments. We fine-tuned an open-source token classifier and used S-tag alignment to reduce boundary errors caused by subword tokenization.
The model is learning structure (which segments/positions contain identifiers), not memorizing clinical content. Perfect labels are more valuable than “realistic” prose.
Benchmark
Strict vs relaxed matching
For compliance, overlap-based NER scores can hide leakage. We report strict boundary metrics (exact start/end) alongside relaxed overlap to show whether any characters remain exposed.
Swipe table →
| Model | Strict Recall | Relaxed Recall (50%) | Strict F1 |
|---|---|---|---|
| Sentedel EDI-PHI v1 | 91.0% | 100.0% | 76.4% |
| GLiNER PII Base | 49.0% | 54.2% | 50.3% |
| NVIDIA GLiNER PII | 35.9% | 40.9% | 39.8% |
| OpenAI Privacy Filter (baseline) | 3.0% | 64.4% | 1.2% |
What we learned
Delimiters, qualifiers, and loop structure break assumptions learned from emails/docs/chat.
Thousands of examples are enough to learn consistent PHI locations across segments.
Many FPs cluster in predictable segment types (service dates, control numbers) and can be removed post-hoc.
Relaxed overlap can look “okay” while still leaking characters that re-identify.
Technical details
Base model: openai/privacy-filter (1.5B params, 50M active, MoE) Training data: 5,000 synthetic 837P transactions, 110,053 PHI spans Training split: 4,200 train / 300 val / 500 test Label strategy: S-tag alignment into existing PII categories Training: 3 epochs, lr=3e-4, bf16, dynamic padding Hardware: NVIDIA L4 (24GB VRAM) Training time: 91 minutes Evaluation: Multi-threshold (strict/relaxed), strict boundary focus