Published on 16/11/2025
Submitting Real-World Evidence with Confidence: Design, Dossier, and Governance
Purpose, Fit-for-Purpose Criteria, and the Global Compliance Frame
Real-world evidence (RWE) becomes submission-grade when three elements align: a precise decision question, a defensible design that answers that question, and a traceable story from the originating records to the number printed in a table. Reviewers do not require perfection; they require proportionate controls, transparency, and reproducibility that protect participants and serve public health. This article offers a compliance-first playbook for moving RWE into regulatory dossiers—how to define the decision, engineer the design, and package
Harmonized anchors. A risk-proportionate, quality-by-design posture is consistent with principles shared by the International Council for Harmonisation. U.S. perspectives on participant protection and trustworthy electronic records that frame observational research appear in public materials from the U.S. Food and Drug Administration. European terminology and evaluation concepts are described by the European Medicines Agency, while ethical and methodological touchstones are echoed by the World Health Organization. For multiregional programs, align artifacts and wording with information shared by Japan’s PMDA and Australia’s Therapeutic Goods Administration so the same methods travel cleanly across jurisdictions.
Define the regulatory decision first. Every submission starts with a one-sentence “why now.” Are you seeking a label expansion in a defined population, fulfilling a post-authorization safety commitment, bridging effectiveness to a new formulation or route, or providing supportive evidence for a single-arm trial? Express the estimand up front—population, treatment strategies, endpoint, handling of intercurrent events, summary measure, and time horizon. All subsequent choices (design, data sources, confounding plan, and statistical estimators) must serve that estimand.
Fit-for-purpose criteria. Demonstrate why the design and data are suitable for the decision: completeness and timeliness of exposure and outcome capture; ability to pin time zero; algorithm validity; measurement frequency relative to the endpoint; and prespecified controls for confounding, missing data, and bias. When a criterion is only partially met, mitigate with design restrictions, conservative definitions, external adjudication, negative controls, or quantitative bias analysis, and document residual risk in plain language.
Target-trial emulation. Translate the estimand into the randomized trial you would have run—eligibility, treatment strategies, assignment, time zero, follow-up rules, endpoints, and analysis plan—and then emulate that trial using observational data. A short target-trial table prevents immortal time and time-lag bias, keeps teams aligned on exposure and outcome definitions before code is written, and gives reviewers a quick way to compare your approach to the interventional gold standard.
System-of-record clarity and ALCOA++. Observational dossiers persuade only when records are attributable, legible, contemporaneous, original, accurate, complete, consistent, enduring, and available. Declare authoritative systems for source data (EHR/EMR, registries, claims) and keep harmonized copies with lineage in your analytics platform. Practice five-minute retrieval drills that click from any figure to the table snapshot, the query or job, the raw payload, and the originating record.
Ethics, privacy, and consent. State the legal basis and consent scope for each source; minimize identifiers; tokenize for linkage; and enforce row-level security. For patient-reported outcomes and decentralized capture, document identity assurance, on-device storage policies, and watermarking of exports. Where consent or jurisdiction limits secondary use, restrict analyses or reconsent; acknowledge the constraint in the protocol and specify contingency paths.
Design & Analysis: Confounding Control, Bias Diagnostics, and Reproducibility
Active-comparator, new-user design. The most powerful bias control happens before modeling. Compare initiators of treatment A to initiators of treatment B that address the same indication. Align line of therapy, care setting, and calendar time. Declare washouts that exclude prevalent users, and lock windows for exposure, outcomes, and censoring. For devices and diagnostics, anchor to procedure timestamps, acquisition parameters, or analytical validity thresholds rather than “orders.”
Confounding strategy. Prespecify covariates that capture disease severity, healthcare utilization, and risk factors. Use propensity score (PS) methods—matching, stratification, or inverse probability weighting—or flexible outcome models; pair them in a doubly robust framework to protect against misspecification. Diagnose balance with standardized mean differences (practical target <0.1) and visualize overlap to confirm positivity. When tails threaten identifiability, prefer overlap or matching weights; trimming without consequence analysis can mask fragility.
Time-varying decisions. When treatment switching, adherence, or disease status both predict outcomes and influence future treatment, standard regression will bias effects. Use marginal structural models with stabilized weights or the parametric g-formula to target per-protocol or dynamic strategies. Predefine truncation rules for extreme weights, show weight distributions, and verify that cumulative hazards behave sensibly under the weighted analysis.
Missing data and measurement error. Distinguish missing covariates (multiple imputation with auxiliary variables) from outcome misclassification (validated algorithms, chart-review subsamples, or probabilistic bias analysis). For EHR labs and vitals, normalize units and enforce biologic range checks; for claims outcomes, increase specificity with site-of-service and procedure corroboration. Store code lists and algorithm versions and maintain short change-control notes that explain what changed and why.
Negative controls and quantitative bias analysis. Choose outcomes not plausibly affected by treatment and exposures not plausibly affecting the outcome. Discordant findings flag residual biases. Quantify vulnerability using E-values or tipping-point analyses that specify how strong an unmeasured confounder would need to be to erase the observed effect. In confirmatory settings, treat these as required—not optional—and explain results in plain language alongside the math.
Heterogeneity and estimands. Prespecify effect modifiers (age bands, renal function, baseline risk) and present absolute risks and risk differences alongside ratios. For competing risks, declare whether the estimand targets cause-specific effects or subdistribution cumulative incidence and align methods accordingly. Label subgroup work as primary or supportive to avoid “spin,” and align payer-relevant cuts with coverage rules.
Reproducibility and sealed cuts. Freeze sealed data cuts and archive manifests capturing inputs, transformations, code hashes, and outputs. Every table footer should reference the cut ID and code hash so reviewers can regenerate results byte-for-byte months later. In distributed networks, include software versions and execution environments in the manifest to preserve cross-site reproducibility.
External controls. When randomized controls are infeasible, build external comparators from registries, EHR networks, or literature using weighting/matching, or use MAIC/STC when only summary data exist. Diagnose exchangeability with balance metrics and common-support plots. If overlap is weak, avoid over-borrowing; present contextual analyses or cap borrowing with prespecified conflict rules and demonstrate operating characteristics via simulation.
Biostatistical quality gates. Enforce pre-run checks (schema conformity, unit and terminology normalization), run checks (row-count reconciliations, null thresholds on key fields), and post-run checks (reproducibility of primary tables, hash stability). Fail gates loudly with owner assignment and dated follow-ups; silent anomalies are inspection traps. File all gates and outcomes in the eTMF as part of the evidence chain.
Dossier Construction: Protocols, SAPs, Tables, and a Readable Evidence Chain
Write observational protocols like interventional protocols. State objectives, the estimand, a design diagram, eligibility, exposure construction, endpoint definitions, follow-up rules, covariate sets, and a directed acyclic graph. Include data-source descriptions (capture processes, coding systems, refresh cadence), linkage rationale, privacy controls, and feasibility counts. Register substantial studies where appropriate and file amendments with numbered “what changed and why” notes and dated approvals.
Statistical analysis plan (SAP). Lock model classes, variable selection, PS specifications, weight truncation thresholds, diagnostics, missing-data methods, and sensitivity analyses before viewing results. For time-to-event outcomes, prespecify cause-specific vs. subdistribution approaches. For repeated measures and PROs, define mixed-model or GEE structures and psychometric scoring. Keep a short “analysis manifest” that lists code hashes, package versions, and environment details to anchor each output.
Tabulation and visualization standards. Provide absolute risks and risk differences in addition to ratios; include numbers-needed-to-treat or harm with interval estimates where meaningful. Use standard shells: population flow; baseline balance (pre/post-adjustment SMDs); exposure persistence; endpoint definitions; main effects with sensitivities side-by-side; and negative-control results. Annotate table footers with data-cut IDs, code hashes, and algorithm versions. For survival outputs, pair hazard ratios with restricted mean survival differences to aid interpretation.
Traceability in the TMF. Treat the evidence chain as a first-class artifact. In the eTMF, file: protocol and amendments; SAP and manifests; code lists and algorithms with versions; sealed-cut manifests; balance diagnostics; primary, supportive, and sensitivity tables; negative-control outcomes; and a short retrieval script or screenshots showing five-minute click-through from a result to the underlying record. Store privacy/consent documentation, supplier assessments, and data-sharing agreements alongside.
Global packaging nuances. Terminology varies by region but the core story is the same: fit-for-purpose data and design, transparent confounding control, traceable results, and proportionate risk management. Describe scientific advice sought, explain how local coding practices and care patterns were handled, and clarify transportability when case-mix differs. Keep hyperlinks to public agency resources to one per agency to avoid clutter while signaling alignment.
Data standards and sharing. Harmonize to common terminologies (SNOMED CT, LOINC, RxNorm/ATC, UCUM; ICD-10-CM/PCS, CPT/HCPCS) and keep mapping tables under version control. Where permissible, provide de-identified, analysis-ready extracts or share code to enable external reproduction; if sharing is restricted, publish algorithms and shells so methods can be recreated independently. Document any limits on sharing and their legal basis.
Devices and diagnostics. For devices, emphasize unique device identifiers, model/firmware lineage, procedure context, and image/waveform provenance. For diagnostics, document analytical validity, thresholds, and recalibration plans. In both, ensure outcome ascertainment is anchored to the device or assay being evaluated and that unit semantics survive each transformation.
Engagement, Responses, Inspections, and Governance That Travel Across Regions
Early engagement and scientific advice. Seek dialogue before locking major choices—design, data sources, external comparators, and endpoints. Provide a concise briefing package with the estimand, target-trial table, data-source fitness criteria, confounding plan, bias diagnostics, and proposed sensitivity analyses. Ask explicit questions about decision thresholds, how real-world and trial evidence will be weighed together, and what additional analyses would change minds.
Responding to information requests. Build reusable, short modules that answer common questions: time-zero definition and windows; algorithm definitions with versions; PS diagnostics and overlap plots; negative-control results; sealed-cut manifests; and retrieval-drill evidence. Each response should include a one-sentence conclusion, a pointer to the exact table or figure, and the manifest ID that proves reproducibility. If new analyses are run, label them clearly as supportive and file an amendment with rationale.
Inspection readiness. Train a small “evidence chain” squad that can reproduce a table live within five minutes. Maintain saved views for role changes, exports, and admin actions in each source system; treat audit trails and manifests as tier-1 data. Rehearse adversarial scenarios: a negative-control signal appears; a confounder shows residual imbalance; an exposure algorithm is updated. The team should demonstrate impact assessments and amended conclusions within days with dated approvals.
Risk management and KRIs/QTLs. Monitor early warnings and promote the consequential to limits: mapping error spikes, missingness surges, weak overlap, unstable weights, retrieval failures, or privacy incidents. Example Quality Tolerance Limits: “post-adjustment SMD >0.1 for any prespecified confounder,” “effective sample size <50% of treated cohort after weighting,” “two sealed-cut reproducibility failures in a month,” or “retrieval pass rate <95%.” Crossing a limit triggers containment, a dated corrective plan, and owner assignment.
Payers and HTA alignment. Present absolute risks, risk differences, and numbers needed to treat or harm; provide subgroup scenarios that mirror coverage rules (prior-line therapy, comorbidity thresholds). Link budget-impact and cost-effectiveness models to sealed cuts so recalculations reproduce; document price year, perspective, and assumptions about rebates and patient support. Be explicit about generalizability when payer populations differ from the data-generating population.
Vendor and network governance. External data partners and technology vendors become part of your evidence system. Assess suppliers for identity controls, logging, export rights, and change discipline; require time-boxed accounts, immutable audit logs, and restoration drills that include logs and metadata. Map every external identity to an internal owner; stale access is an owned risk with due dates. Rehearse exit paths so data and audit trails remain intact if services change.
Transparency and publication. Register substantial RWE studies where appropriate, publish algorithms (code lists and logic) when legally possible, and report deviations from the SAP with clear rationales. Null and negative findings deserve the same transparency as positive ones. The most convincing dossiers make it trivial to understand how answers were derived and how stable they are under reasonable perturbations; that clarity saves time in scientific advice and inspection.
Bottom line. Submission-grade RWE is a small, disciplined system: a precise decision question, fit-for-purpose data and design, transparent confounding control with diagnostics, sealed cuts and provenance, and packaging that lets reviewers click from any number to the underlying record. Build it once—target-trial tables, algorithms, manifests, diagnostics, retrieval drills—and the same backbone will carry label changes, safety actions, and payer negotiations across regions with confidence.