Published on 15/11/2025
Engineering Data Quality and Provenance for Regulatory-Grade Real-World Evidence
Purpose, Principles, and the Global Frame for Trusted RWD
Real-world data (RWD) becomes decision-grade real-world evidence (RWE) when its quality can be explained and proven in minutes. Quality is not a single score; it is a set of properties—fitness for purpose, conformance to standards, completeness, timeliness, accuracy, and consistency—tied together by a readable chain of provenance from the analytic table back to the originating record. Provenance answers four questions for every value used in analysis: who created or changed it,
Harmonized anchors. A proportionate, quality-by-design posture reflects principles shared by the International Council for Harmonisation. U.S. expectations around participant protection and trustworthy electronic records are summarized in educational material provided by the Food and Drug Administration. European evaluation perspectives and terminology are presented by the European Medicines Agency, while ethical touchstones—respect, fairness, intelligibility—are emphasized by the World Health Organization. Programs spanning Japan and Australia should keep terminology and packaging coherent with information shared by PMDA and the Therapeutic Goods Administration so that a single evidence story travels across jurisdictions.
ALCOA++ as the spine. Records must be attributable, legible, contemporaneous, original, accurate, complete, consistent, enduring, and available. Translate this into operations: identity-bound signatures, human-readable audit trails, immutable timestamps (local and UTC), version-locked algorithms and code lists, and “five-minute retrieval drills” that click from any table cell to the raw artifact and its audit trail. If an analyst needs an afternoon to reconstruct a number, the control has failed—no matter how polished the dashboard looks.
System-of-record clarity. Avoid “two truths.” Declare which platform is authoritative for each object: EHR/EMR systems for clinical artifacts; claims platforms for adjudicated encounters, dispenses, and costs; registries for natural history and device performance; PRO platforms for signed instruments; and your analytical lakehouse for harmonized copies with lineage. Do not let spreadsheets or ad-hoc exports become unofficial sources of record; store links and hashes, not silent duplicates.
Fitness for purpose, not perfection. Data quality is contextual. A claims dataset can be superb for utilization chronology but poor for clinical severity; an EHR network can supply granular labs with occasional measurement idiosyncrasies; PROs provide patient-centric outcomes but demand psychometric discipline. Define quality requirements from the estimand: if the endpoint is hospitalization-free survival, timeliness and discharge coding specificity dominate; if the endpoint is a lab threshold, unit normalization and device metadata matter most. Write these requirements before accessing data to prevent retrofitting.
Standards and semantics. Harmonize to controlled terminologies—SNOMED CT for conditions, LOINC for labs, RxNorm/ATC for drugs, UCUM for units, ICD-10-CM/PCS and CPT/HCPCS for administrative coding. Preserve the mapping tables as first-class, version-controlled artifacts with short, human-readable notes explaining what changed and why. For EHR exchanges, capture HL7 FHIR Provenance alongside content so that identity, location, and device context are never guesses.
Provenance by Design: Lineage, Manifests, and Reproducibility That Explain Themselves
Ingestion manifests. Every intake into the analytical platform should carry a manifest: source identifier, legal basis/consent reference, schema version, terminology versions, file names and byte sizes, cryptographic hashes, record counts by domain, and a timestamp for when the data left the source. Manifests make “what exactly did we analyze?” a button click rather than an archaeological dig.
Stable identifiers and joins. Establish deterministic keys for patients, encounters, labs, and exposures that survive system upgrades and vendor swaps. For linkage, prefer privacy-preserving tokens or deterministic keys under access control. Store linkage quality metrics (match rates, duplicates, conflicts) and keep the crosswalk as a controlled artifact—never inline IDs into filenames or logs.
Unit and vocabulary normalization. Normalize labs and measurements to UCUM; bind each to its LOINC code and specimen metadata. Record the device model/firmware and method (where available) to interpret shifts. For medications, keep NDC↔RxNorm mappings current; for diagnoses and procedures, track ICD/CPT versions and transitions. A single, version-locked “standards registry” reduces drift across studies and time.
Derivations that travel. Derived variables (e.g., “on-treatment exposure,” “comorbidity score,” “visit window status”) must store code and parameter hashes, inputs, and a short description in plain language. Parameterized notebooks or SQL should render a one-page “recipe” per derivation that can be read by clinicians and auditors alike. If a reviewer cannot understand the steps without reading source code, the derivation is too opaque.
Sealed data cuts. Freeze time-stamped, write-protected snapshots of all tables and files used for an analysis, plus the exact code and environment. Tables and figures reference the cut ID and code hash so they can be regenerated byte-for-byte months later. Sealed cuts end arguments about “which refresh” produced a result and are indispensable when multiple agencies or journals ask for reproduction.
Audit trail readability. Keep human-readable views of imports, transforms, and exports with filtering by date, user, table, and study. Include summaries (“rows changed,” “columns added,” “units normalized”) and links to the manifests. Cryptic logs are not compliance; they are stress.
Time and clocks. Persist both local time and UTC for clinical events, ingestions, and transforms. Record time-zone and DST transitions so event order and exposure windows are defensible across regions. For telehealth or home capture, store visit modality and identity assurance context to support data integrity assertions.
Interfaces and APIs. Where feeds are API-based, record rate limits, retries, and failure queues. Enforce idempotency and attach correlation IDs so a failed batch can be replayed without duplication. Designate quarantine zones for payloads that fail conformance checks, and require explicit release after remediation.
Files beyond tables. For imaging, waveforms, and PDFs, store raw objects in durable storage with checksums; keep human-readable renders nearby and link them from tables with deterministic paths. Analysts and reviewers should be able to click from a Kaplan–Meier point to the exact report or image that justified the event.
Retention and restoration. Back up raw zones, manifests, lineage graphs, and sealed cuts. Quarterly restore drills should demonstrate that records, audit trails, and signatures return intact within RTO/RPO. Restoration is part of provenance—if you cannot get proof back after an incident, you never had it.
Measuring Quality: Metrics, Dashboards, KRIs/QTLs, and Fed-Network Realities
Define metrics tied to the estimand. Quality metrics should mirror the decision the study must support:
- Completeness: proportion of required fields populated for the target cohort and window (e.g., labs within ±7 days of index).
- Timeliness: ingestion and refresh lag vs. SLA (e.g., 95% of feeds within 14 days); claims adjudication lag modeled explicitly.
- Accuracy: PPV/NPV from chart validation subsamples for key outcomes; unit checks against biologic plausibility.
- Conformance: adherence to schema, code sets, and units; percentage of values mapping to recognized terminologies.
- Consistency: longitudinal stability (e.g., sudden code mix shifts after policy change), cross-table coherence (order→result).
- Uniqueness: duplicate person or encounter rates; de-duplication success for multi-source linkages.
Dashboards that click to proof. Display metrics by source, site, and study with trend lines and drill-through to the underlying records, manifests, and change notes. At a minimum: mapping error rate, unit normalization failures, completeness by domain/wave, ingestion lag, negative-control outcome rates, and five-minute retrieval pass rate. Numbers without provenance links are not inspection-ready.
Key Risk Indicators (KRIs) and Quality Tolerance Limits (QTLs). Examples of KRIs: spikes in “unknown/other” codes; abrupt shifts in diagnosis or procedure mix; rising linkage conflicts; recurrent unit anomalies; sealed-cut reproducibility failures. Promote consequential KRIs to QTLs, such as: “post-mapping missingness >10% in any critical field,” “ingestion lag >30 days for >10% of feeds,” “≥5% of lab rows failing UCUM normalization,” “PPV <80% in validation subsample for primary endpoint,” or “retrieval pass rate <95%.” Crossing a limit triggers containment (freeze analyses, isolate sources), dated corrective plans, and owner assignment.
Negative controls and coherence checks. Use negative-control outcomes (not plausibly affected by exposure) and exposures (not plausibly affecting outcome) to probe residual biases and data idiosyncrasies. Add coherence checks such as “procedure without eligible diagnosis,” “death after subsequent encounters,” or “dispense without coverage,” routed to data stewards for remediation and documentation.
Validation subsamples and sampling frames. For EHR-derived outcomes, perform chart review subsamples sized to bound PPV/NPV with useful precision. Select records stratified by site and time to reveal heterogeneity, and file abstraction tools and decision aids as controlled documents. For device readings, include spot checks of raw files and device logs.
Federated networks. When data cannot leave institutions, ship algorithms to sites with a common data model. Record each site’s execution environment (terminology versions, software versions, algorithm hashes). Return only de-identified aggregates or subject-level outputs under governance. Meta-analyze site-level results with random effects when practice patterns differ materially. Provenance includes the site’s “who/what/when/why,” not just pooled results.
Operational monitors and alerts. Automate notifications for schema drift, vocabulary updates, API failures, and rising ingestion lag. Tag incidents with severity and business impact; keep a public (within the program) changelog with “what changed and why” in plain language so analysis teams are not surprised mid-workstream.
People and training. Quality lives or dies with human behavior. Train analysts to use standard mappings and recipes; train clinicians and abstraction teams on definition nuance; train data stewards to triage anomalies efficiently. Capture “I applied this” attestations tied to records for key steps—especially when manual review determines outcome assignments.
Governance, Contracts, 30–60–90 Plan, Pitfalls, and a Ready-to-Use Checklist
Ownership and the meaning of approval. Keep decision rights small and named: a Data Steward (standards and lineage), Clinical/Epidemiology Lead (definitions and plausibility), Biostatistician (estimands and quality metrics), Security/Privacy Lead (identity, linkage, access), and Quality (ALCOA++ checks and retrieval drills). Each sign-off states its meaning—“mappings verified,” “endpoint definitions validated,” “privacy controls tested,” “sealed-cut reproducibility confirmed.” Ambiguous approvals become inspection liabilities.
SOPs and documentation. Publish concise SOPs for ingestion, mapping, derivation, sealed cuts, validation subsamples, and restoration. Pair each with role-based work instructions and embedded checklists. Store deviations with a short “what changed and why” note and residual risk rationale. Documentation should be short, human-readable, and obviously tied to outcomes that matter.
Contracts and supplier governance. Treat data partners and technology vendors as part of your evidence system. Contracts must guarantee export rights (data, metadata, audit trails, manifests) in open formats; define uptime/SLA and change-notice windows; and require immutable logs and time-boxed access for service accounts. For clinical sources, specify coding/standards commitments, chart validation support, and obligations to notify of coding practice changes (e.g., new order sets) that could shift apparent incidence.
30–60–90-day implementation plan. Days 1–30: define estimand-aligned quality requirements; declare authoritative systems; inventory sources and standards; draft ingestion/lineage SOPs; create the standards registry; and run a five-minute retrieval drill on a pilot feed. Days 31–60: stand up manifests, unit/vocabulary normalization, and sealed cuts; configure dashboards with completeness/timeliness/conformance metrics; launch validation subsample workflows; and publish KRIs/QTLs with thresholds. Days 61–90: expand to all sources; automate schema-drift and lag alerts; institutionalize monthly negative-control and reproducibility checks; enforce QTLs with containment playbooks; and convert recurrent issues into design fixes (mapping rules, data contracts), not reminders.
Common pitfalls—and durable fixes.
- “Quality theater.” Beautiful dashboards with no lineage. Fix with manifests, code hashes, and sealed cuts wired into every tile.
- Two sources of truth. Shadow extracts drive analysis. Fix with system-of-record declarations and deep links; retire uncontrolled copies.
- Unit chaos. Labs compared across inconsistent units. Fix with UCUM normalization and hard blocks on ambiguous values.
- Schema drift surprises. A minor EHR upgrade breaks definitions. Fix with drift monitors, quarantine zones, and change-notice obligations.
- Unreproducible figures. Re-runs don’t match. Fix with sealed cuts, code pinning, and nightly regeneration tests of key tables.
- Opaque transformations. Derivations buried in code. Fix with one-page recipes and parameter hashes visible to clinicians.
- Linkage overconfidence. False matches distort effects. Fix with cross-source coherence checks, conflict logs, and stratified validation.
Ready-to-use data quality & provenance checklist (paste into your SOP or build plan).
- Authoritative systems declared; deep links replace shadow copies.
- Standards registry published (SNOMED/LOINC/RxNorm/ATC/UCUM; ICD/CPT) with versions and change notes.
- Ingestion manifests capture hashes, schema/terminology versions, counts, timestamps, and legal basis.
- Unit and vocabulary normalization active; device/method metadata retained where available.
- Derivation recipes stored with inputs, parameters, hashes, and plain-language descriptions.
- Sealed data cuts implemented; table/figure footers cite cut IDs and code hashes.
- Dashboards show completeness, timeliness, conformance, consistency, uniqueness, and negative-control results with drill-through to artifacts.
- KRIs/QTLs defined and enforced; containment playbooks documented with owners and dates.
- Validation subsamples executed and filed; PPV/NPV documented for key outcomes.
- Restore drills passed; records, audit trails, and signatures return intact within RTO/RPO.
Bottom line. Trusted RWE is not an accident—it is engineered. Build a small, disciplined system where standards and mappings are version-locked, transformations are readable, sealed cuts anchor every number, dashboards click to proof, and retrieval drills are routine. Do that once and your teams will protect participants, move faster, and face regulators, HTA bodies, and journals with confidence.