De-identification - Pidgeon Health

De-identification Methods

Safe Harbor

Removes all 18 HIPAA Safe Harbor identifiers. The simplest and most conservative approach.

Safe Harbor Plus

Removes identifiers and replaces them with realistic synthetic values. The result looks like a real message — useful for testing downstream systems that reject empty fields.

Expert Determination

Statistical approach using k-anonymity and l-diversity analysis. Configurable risk thresholds let you balance data utility against re-identification risk. Produces equivalence class analysis and risk scoring reports.

Full Synthetic

Replaces the entire message with a synthetic equivalent that preserves clinical structure but shares no values with the original.

Usage

# De-identify a directory
pidgeon deident --in ./real-messages --out ./safe-messages --date-shift 30d

# Deterministic output for team sharing
pidgeon deident --in ./real --out ./safe --date-shift 90d --salt "project-x"

# Preserve message control IDs for correlation
pidgeon deident --in ./real --out ./safe --date-shift 30d --keep-ids

What Gets Replaced

80+ PHI fields across 10 HL7 segment types are mapped and handled:

Identifier Type	Examples	Action
Patient name	PID.5	Replaced with synthetic name
MRN / Patient ID	PID.3	Replaced (or kept with `--keep-ids`)
SSN	PID.19	Removed entirely
Date of birth	PID.7	Date-shifted
Address	PID.11, NK1.4, GT1.5	Replaced with synthetic address
Phone / email	PID.13, PID.14, NK1.5	Replaced with synthetic values
Provider name / NPI	OBR.16, PV1.7, PV1.8, PV1.9	Replaced
Account number	PID.18	Replaced maintaining format
Insurance ID	IN1.36, IN2 fields	Replaced
Device / biometric IDs	Various	Removed or replaced
All date/datetime fields	Across all segments	Shifted by consistent offset
Free text fields	OBX.5, NTE.3	Scanned for embedded PHI patterns

Segments covered include MSH, PID, NK1, PV1, PV2, OBR, OBX, GT1, IN1, and IN2. Custom field mappings can be added for organization-specific PHI locations.

Risk Assessment

Post can assess re-identification risk for your de-identified output:

k-anonymity scoring — Measures whether individuals can be singled out

l-diversity analysis — Checks sensitive attribute diversity within equivalence classes

Compliance reporting — HTML and JSON reports suitable for audit documentation

Consistency Across Batches

When de-identifying multiple messages from the same patient, relationships are preserved:

Same input MRN always produces the same synthetic MRN (within a salt context)

ID mappings persist across runs when using --salt

Temporal relationships between messages are maintained through consistent date shifting

Documentation Index

​De-identification Methods

​Usage

​What Gets Replaced

​Risk Assessment

​Consistency Across Batches

De-identification Methods

Usage

What Gets Replaced

Risk Assessment

Consistency Across Batches