Record ingestion: raw provider data to longitudinal history
How raw provider records are deduplicated, FHIR-normalized, and merged into a patient's longitudinal clinical history.
Raw clinical records arrive from dozens of heterogeneous sources — EHR exports, patient-uploaded PDFs, lab direct connections, and pharmacy feeds. The ingestion system transforms that noise into a single, deduplicated, FHIR-normalized longitudinal history per patient. This document describes the topology, the lifecycle, and the key implementation decisions.
System topology
The ingestion stack has four logical layers: connector adapters at the edge, a normalization service in the middle, a deduplication engine, and the longitudinal store that downstream features read from.
Data flow: from connector event to stored record
Each connector emits an event onto a Cloudflare Queue topic when a new document or data chunk arrives. The normalizer worker consumes that event, validates it against the FHIR R4 schema, and enriches it with patient context before handing off to the dedup engine.
Record lifecycle
Every record moves through five discrete stages. The pipeline cursor below shows the steady-state path — failure at any stage routes to the dead-letter queue rather than propagating a partial write.
If the normalizer succeeds but the dedup engine times out, the event lands in the dead-letter queue. Ops must reprocess DLQ entries manually after diagnosing the root cause. Do not bypass the DLQ by re-ingesting the original source event — you will create duplicates.
Deduplication strategy
The dedup engine uses a Durable Object per patient to serialize all matching decisions for that patient. It computes a composite fingerprint from (resourceType, code.system, code.code, effectiveDateTime ± 24h) for clinical observations, and (resourceType, identifier.system, identifier.value) for documents and encounters.
A new record is considered a duplicate if its fingerprint matches an existing stored record within the same patient scope. When a match is found, the engine compares lastUpdated timestamps and retains the more recent version, updating the stored record in place rather than appending a new row.
The 24-hour window on effectiveDateTime handles common EHR clock-skew. Tighten it to zero for lab results with millisecond precision — use the dedupWindow config key per connector.
FHIR normalization contract
Every record leaving the normalizer must conform to the following shape. Connectors that cannot produce a conformant record send the raw payload to the dead-letter queue with a structured error attached.
The meta.source field is the primary attribution key — it tells downstream consumers which connector produced the record and enables per-source replay without touching records from other connectors.
Provenance and audit
- Every write to the FHIR store appends a row to
record_provenancewith(record_id, connector_id, event_id, ingested_at). - The dedup engine logs its decision (
new | updated | duplicate) alongside the winning fingerprint todedup_log. - DLQ entries include the full raw payload, the normalizer error, and the original queue message ID for correlation.
See the longitudinal record PRD for the product goals this architecture serves, and the 2026 records roadmap for the delivery timeline.