Skip to main content
Four methods — from simple Safe Harbor removal to full synthetic replacement — covering 80+ PHI fields across 10 HL7 segment types. Everything runs locally.
This tool assists with de-identification but does not guarantee HIPAA Safe Harbor compliance on its own. Always review output and consult your compliance team.

De-identification Methods

Removes all 18 HIPAA Safe Harbor identifiers. The simplest and most conservative approach.
Removes identifiers and replaces them with realistic synthetic values. The result looks like a real message — useful for testing downstream systems that reject empty fields.
Statistical approach using k-anonymity and l-diversity analysis. Configurable risk thresholds let you balance data utility against re-identification risk. Produces equivalence class analysis and risk scoring reports.
Replaces the entire message with a synthetic equivalent that preserves clinical structure but shares no values with the original.

Usage

# De-identify a directory
pidgeon deident --in ./real-messages --out ./safe-messages --date-shift 30d

# Deterministic output for team sharing
pidgeon deident --in ./real --out ./safe --date-shift 90d --salt "project-x"

# Preserve message control IDs for correlation
pidgeon deident --in ./real --out ./safe --date-shift 30d --keep-ids

What Gets Replaced

80+ PHI fields across 10 HL7 segment types are mapped and handled:
Identifier TypeExamplesAction
Patient namePID.5Replaced with synthetic name
MRN / Patient IDPID.3Replaced (or kept with --keep-ids)
SSNPID.19Removed entirely
Date of birthPID.7Date-shifted
AddressPID.11, NK1.4, GT1.5Replaced with synthetic address
Phone / emailPID.13, PID.14, NK1.5Replaced with synthetic values
Provider name / NPIOBR.16, PV1.7, PV1.8, PV1.9Replaced
Account numberPID.18Replaced maintaining format
Insurance IDIN1.36, IN2 fieldsReplaced
Device / biometric IDsVariousRemoved or replaced
All date/datetime fieldsAcross all segmentsShifted by consistent offset
Free text fieldsOBX.5, NTE.3Scanned for embedded PHI patterns
Segments covered include MSH, PID, NK1, PV1, PV2, OBR, OBX, GT1, IN1, and IN2. Custom field mappings can be added for organization-specific PHI locations.

Risk Assessment

Post can assess re-identification risk for your de-identified output:
  • k-anonymity scoring — Measures whether individuals can be singled out
  • l-diversity analysis — Checks sensitive attribute diversity within equivalence classes
  • Compliance reporting — HTML and JSON reports suitable for audit documentation

Consistency Across Batches

When de-identifying multiple messages from the same patient, relationships are preserved:
  • Same input MRN always produces the same synthetic MRN (within a salt context)
  • ID mappings persist across runs when using --salt
  • Temporal relationships between messages are maintained through consistent date shifting