Sentedel
Technical Report · June 2026

Closing the NTE Gap in EDI PHI Detection

The first model trained to handle PHI detection across both structured X12 segments and embedded NTE free text — and the benchmark that proves it works.

Relaxed F1
99.7%
Strict F1
89.9%
NTE Recall
93.0%

Background

NTE segments

Existing X12 EDI de-identification approaches handle structured segments reliably — PHI in NM1, N3/N4, and DMG sits in predictable positions with known qualifiers. The NTE segment presents a different challenge: it is a free-text field embedded within an otherwise structured transaction, appearing in Loop 2300 (claim-level) and Loop 2400 (service line) of 837P claims. NTE content contains PHI in natural language — patient names, dates of birth, member IDs — without predictable element positions.

Results

Benchmark results

Four models were evaluated on 500 synthetic 837P transactions, 162 of which contained NTE segments with embedded PHI. NTE content was generated using Qwen 2.5 7B-Instruct to produce diverse clinical notes with varied phrasing and casing.

Overall performance

Swipe table →

Model Relaxed F1 Strict F1
Sentedel v2.1 99.7% 89.9%
OpenAI Privacy Filter (baseline) 30.0% 1.4%
GLiNER PII Base 58.0% 52.8%
NVIDIA GLiNER PII 43.4% 37.8%

v2.1 achieves 99.7% relaxed F1 and 89.9% strict F1, improving over v1 (86.0% / 78.3%) on a harder evaluation that includes free-text PHI.

Structured vs NTE recall

NTE-specific recall by model:

Swipe table →

Model Structured Recall NTE Recall
Sentedel v2.1 99.7% 93.0%
OpenAI Privacy Filter 65.2% 67.1%
GLiNER PII Base 53.9% 29.4%
NVIDIA GLiNER PII 38.2% 40.8%

GLiNER PII Base achieves 29.4% NTE recall; NVIDIA GLiNER reaches 40.8%; the unfine-tuned Privacy Filter reaches 67.1%. Sentedel v2.1 achieves 93.0% NTE recall while maintaining 99.7% on structured segments.

Approach

Methodology

v2 Experiment
Dual-model routing

v2 tested routing structured segments to the fine-tuned model and NTE text to a clinical de-ID model (obi/deid_roberta_i2b2). The clinical model, trained on lowercase EHR narratives, underperformed on uppercase EDI-style NTE text — achieving 68.2% NTE recall vs 99.3% from the structural model alone.

v2.1 Architecture
Single unified model

A single model fine-tuned on both structured and NTE-containing data outperforms the specialized dual-model routing architecture for this task.

LLM-generated NTE templates

v2 used 20 handcrafted NTE templates, which were predictable enough to achieve 99.3% recall without genuine generalization. v2.1 replaces these with LLM-generated templates from Qwen 2.5 7B-Instruct (4-bit quantized), producing clinical notes with placeholder tokens (<<PATIENT_NAME>>, <<MEMBER_ID>>, etc.) across varied styles and clinical contexts.

During EDI generation, placeholders are replaced with Faker-generated PHI values, providing exact character-level labels through offset tracking. This yields 217 unique templates and a more representative evaluation — NTE recall dropped from 99.3% to 93.0%, confirming the model is tested against diverse clinical language.

Per-category

Per-category recall

Recall at moderate (60%) threshold across all 13 PHI categories:

Swipe table →

Category Baseline v2.1 Improvement
patient_address39.3%100.0%+60.7%
provider_address35.3%100.0%+64.7%
entity_id51.7%100.0%+48.3%
patient_name51.1%96.9%+45.9%
npi65.8%100.0%+34.2%
phone_number73.3%100.0%+26.7%
member_id75.9%100.0%+24.1%
contact_name76.7%97.9%+21.3%
date_of_birth84.7%99.8%+15.2%
group_number86.2%100.0%+13.8%
tax_id92.4%100.0%+7.6%
claim_id96.0%100.0%+4.0%
email_address99.6%100.0%+0.4%

Technical details

Base model:       openai/privacy-filter (1.5B params, 50M active, MoE)
Training data:    5,000 synthetic 837P transactions (35% with NTE segments)
NTE templates:    217 unique (137 LLM-generated + 80 handcrafted fallback)
Training split:   4,200 train / 300 val / 500 test
Label strategy:   S-tag alignment into existing PII categories
Training:         3 epochs, lr=3e-4, bf16, dynamic padding, batch 4 (eff. 16)
Evaluation:       Multi-threshold (strict/relaxed), structured vs NTE breakdown

Try it

Demo

The v2.1 model is available as a demo API for X12 text including NTE segments.

Research demo — do not submit real patient data.