Sentedel — v2.1: Closing the NTE Gap in EDI PHI Detection

Background

NTE segments

Existing X12 EDI de-identification approaches handle structured segments reliably — PHI in NM1, N3/N4, and DMG sits in predictable positions with known qualifiers. The NTE segment presents a different challenge: it is a free-text field embedded within an otherwise structured transaction, appearing in Loop 2300 (claim-level) and Loop 2400 (service line) of 837P claims. NTE content contains PHI in natural language — patient names, dates of birth, member IDs — without predictable element positions.

Results

Benchmark results

Four models were evaluated on 500 synthetic 837P transactions, 162 of which contained NTE segments with embedded PHI. NTE content was generated using Qwen 2.5 7B-Instruct to produce diverse clinical notes with varied phrasing and casing.

Overall performance

Swipe table →

Model	Relaxed F1	Strict F1
Sentedel v2.1	99.7%	89.9%
OpenAI Privacy Filter (baseline)	30.0%	1.4%
GLiNER PII Base	58.0%	52.8%
NVIDIA GLiNER PII	43.4%	37.8%

v2.1 achieves 99.7% relaxed F1 and 89.9% strict F1, improving over v1 (86.0% / 78.3%) on a harder evaluation that includes free-text PHI.

Structured vs NTE recall

NTE-specific recall by model:

Swipe table →

Model	Structured Recall	NTE Recall
Sentedel v2.1	99.7%	93.0%
OpenAI Privacy Filter	65.2%	67.1%
GLiNER PII Base	53.9%	29.4%
NVIDIA GLiNER PII	38.2%	40.8%

GLiNER PII Base achieves 29.4% NTE recall; NVIDIA GLiNER reaches 40.8%; the unfine-tuned Privacy Filter reaches 67.1%. Sentedel v2.1 achieves 93.0% NTE recall while maintaining 99.7% on structured segments.

Approach

Methodology

v2 Experiment

Dual-model routing

v2 tested routing structured segments to the fine-tuned model and NTE text to a clinical de-ID model (obi/deid_roberta_i2b2). The clinical model, trained on lowercase EHR narratives, underperformed on uppercase EDI-style NTE text — achieving 68.2% NTE recall vs 99.3% from the structural model alone.

v2.1 Architecture

Single unified model

A single model fine-tuned on both structured and NTE-containing data outperforms the specialized dual-model routing architecture for this task.

LLM-generated NTE templates

v2 used 20 handcrafted NTE templates, which were predictable enough to achieve 99.3% recall without genuine generalization. v2.1 replaces these with LLM-generated templates from Qwen 2.5 7B-Instruct (4-bit quantized), producing clinical notes with placeholder tokens (<<PATIENT_NAME>>, <<MEMBER_ID>>, etc.) across varied styles and clinical contexts.

During EDI generation, placeholders are replaced with Faker-generated PHI values, providing exact character-level labels through offset tracking. This yields 217 unique templates and a more representative evaluation — NTE recall dropped from 99.3% to 93.0%, confirming the model is tested against diverse clinical language.

Per-category

Per-category recall

Recall at moderate (60%) threshold across all 13 PHI categories:

Swipe table →

Category	Baseline	v2.1	Improvement
patient_address	39.3%	100.0%	+60.7%
provider_address	35.3%	100.0%	+64.7%
entity_id	51.7%	100.0%	+48.3%
patient_name	51.1%	96.9%	+45.9%
npi	65.8%	100.0%	+34.2%
phone_number	73.3%	100.0%	+26.7%
member_id	75.9%	100.0%	+24.1%
contact_name	76.7%	97.9%	+21.3%
date_of_birth	84.7%	99.8%	+15.2%
group_number	86.2%	100.0%	+13.8%
tax_id	92.4%	100.0%	+7.6%
claim_id	96.0%	100.0%	+4.0%
email_address	99.6%	100.0%	+0.4%

Technical details

Base model:       openai/privacy-filter (1.5B params, 50M active, MoE)
Training data:    5,000 synthetic 837P transactions (35% with NTE segments)
NTE templates:    217 unique (137 LLM-generated + 80 handcrafted fallback)
Training split:   4,200 train / 300 val / 500 test
Label strategy:   S-tag alignment into existing PII categories
Training:         3 epochs, lr=3e-4, bf16, dynamic padding, batch 4 (eff. 16)
Evaluation:       Multi-threshold (strict/relaxed), structured vs NTE breakdown

Try it

Demo

The v2.1 model is available as a demo API for X12 text including NTE segments.

Try the v2.1 demo EDI-PHI demo

Research demo — do not submit real patient data.