Sentedel
Technical Blog · June 2026

PHI Detection for EDI Transactions.

Building the first PHI detection benchmark for payer-side healthcare EDI — and why general-purpose models fail on structured transaction data.

Relaxed recall
100%
Strict recall
91.0%
Strict F1
76.4%

EDI (X12) transactions contain PHI, but they don’t look like normal text. A patient name lives inside delimited segments like NM1*IL*1*JOHNSON*MICHAEL*T***MI*XKW123456789~, not inside sentences. We benchmarked modern PII detectors on synthetic—but structurally valid—payer-side EDI to measure what breaks and what fixes it.

Finding
General-purpose models fail on EDI

Baseline PII models fragment spans across delimiters and qualifiers, collapsing strict boundary performance.

Fix
Domain fine-tuning works fast

With a few thousand EDI examples, performance jumps from near-zero strict F1 to production-usable accuracy.

Approach

Synthetic EDI + S-tag alignment

Real payer EDI is PHI by definition, so we generated valid 837P transactions with realistic identifiers (names, addresses, DOBs, IDs) placed in the correct segments. We fine-tuned an open-source token classifier and used S-tag alignment to reduce boundary errors caused by subword tokenization.

Why synthetic works here

The model is learning structure (which segments/positions contain identifiers), not memorizing clinical content. Perfect labels are more valuable than “realistic” prose.

Benchmark

Strict vs relaxed matching

For compliance, overlap-based NER scores can hide leakage. We report strict boundary metrics (exact start/end) alongside relaxed overlap to show whether any characters remain exposed.

Swipe table →

Model Strict Recall Relaxed Recall (50%) Strict F1
Sentedel EDI-PHI v1 91.0% 100.0% 76.4%
GLiNER PII Base 49.0% 54.2% 50.3%
NVIDIA GLiNER PII 35.9% 40.9% 39.8%
OpenAI Privacy Filter (baseline) 3.0% 64.4% 1.2%

What we learned

1
EDI is out-of-distribution

Delimiters, qualifiers, and loop structure break assumptions learned from emails/docs/chat.

2
Fine-tuning is high leverage

Thousands of examples are enough to learn consistent PHI locations across segments.

3
False positives are filterable

Many FPs cluster in predictable segment types (service dates, control numbers) and can be removed post-hoc.

4
Strict metrics matter

Relaxed overlap can look “okay” while still leaking characters that re-identify.

Technical details

Base model:       openai/privacy-filter (1.5B params, 50M active, MoE)
Training data:    5,000 synthetic 837P transactions, 110,053 PHI spans
Training split:   4,200 train / 300 val / 500 test
Label strategy:   S-tag alignment into existing PII categories
Training:         3 epochs, lr=3e-4, bf16, dynamic padding
Hardware:         NVIDIA L4 (24GB VRAM)
Training time:    91 minutes
Evaluation:       Multi-threshold (strict/relaxed), strict boundary focus