Published on 15/11/2025
Designing Defensible Sample Sizes and Power for Regulatory-Grade Clinical Trials
Why Sample Size Justification Matters: Regulatory Expectations, Estimands, and Risk
Sample size and power are not just arithmetic—they are a formal commitment to detect clinically relevant effects while safeguarding participants and resources. A credible justification shows that the trial can answer its estimand with pre-specified error rates, that Type I error is controlled across all planned analyses, and that uncertainty in assumptions has been explored transparently. Major authorities—including the U.S. FDA, the EMA,
Anchor to the estimand. Power calculations must reflect the treatment effect you intend to estimate, the intercurrent events you will handle (treatment discontinuation, rescue medication, death), the population, and the endpoint summary measure. If the primary estimand is the treatment policy strategy with a continuous endpoint at Week 24, then assumptions (means, variance, correlation to baseline, missingness pattern) must match that framing; if a composite strategy censors after rescue, your event rates and hazard assumptions should reflect that composite definition.
Error rates are policy choices. Most confirmatory trials target two-sided α=0.05 (or one-sided 0.025) and power of 80–90%. These are not moral absolutes; they are negotiated risk tolerances. For life-threatening indications with limited feasibility, 80% power may be defensible; for large, chronic indications, 90% power is common. Whatever you choose, justify it with clinical context and feasibility.
Clinically meaningful effect vs statistical detectability. The design should be powered for a clinically important effect (or a prespecified noninferiority margin) informed by prior evidence and patient/clinician value. Effects set by convenience invite regulatory challenge; margins set too tight can inflate unnecessary exposure.
Transparency about uncertainty. No single set of inputs is “true.” Show how power varies across plausible ranges for the effect, variance, event rates, dropout, and adherence. Provide tornado charts or tables for decision-makers and include binding contingency plans (e.g., blinded sample size re-estimation) if assumptions drift.
Operational feasibility. Recruitment rates, site activation curves, screen-fail proportions, and adherence to visit schedules interact with statistical power. A plan that requires unrealistic accrual or retention is de facto underpowered even if the arithmetic is correct.
Documentation set. Place the sample size in the Protocol (objectives, endpoints, effect and variance assumptions, margins, allocation ratio, multiplicity strategy, timing of interim looks), with computational detail and sensitivity analyses in the SAP or an Appendix. Capture all configuration and program versions as part of computerized system assurance recognizable to FDA/EMA/PMDA/TGA.
From Assumptions to N: Models, Inputs, and Endpoint-Specific Considerations
Define the model first. Power depends on the statistical test or model that will analyze the primary endpoint. Choose the analysis consistent with the estimand and then select or derive the sample size formula or simulation approach that matches that analysis.
- Continuous endpoints (e.g., change from baseline): two-sample t-test or ANCOVA. Key inputs: mean difference (Δ), standard deviation (pooled or arm-specific), correlation with baseline (for ANCOVA), allocation ratio, and sidedness. ANCOVA often reduces variance; document assumed correlation and its source.
- Binary endpoints (proportion responders): two-proportion test, logistic model. Inputs: control response rate (pC), treatment rate (pT) or risk difference/ratio/odds ratio, continuity corrections (if any), allocation ratio, and possible stratification.
- Time-to-event endpoints (overall survival, PFS): log-rank/Cox model. Power is a function of the number of events, not only enrolled participants. Inputs: hazard ratio target, baseline hazard (or median), accrual duration, additional follow-up, and censoring pattern. State whether proportional hazards is assumed; if dubious, consider weighted log-rank or alternative summaries and simulate.
- Count endpoints (exacerbations): negative binomial or Poisson with over-dispersion. Inputs: baseline rate, dispersion parameter, exposure time variability.
- Ordinal/Responder scales: proportional odds or cumulative logit. Inputs: category probabilities and odds ratio; verify the proportional odds assumption or plan for partial proportional models and simulate.
Superiority vs noninferiority vs equivalence. For noninferiority (NI), specify the margin on the effect scale used for decision making (risk difference, log-HR, mean difference). Margins should be justified clinically and statistically (constancy and assay sensitivity rationale). Equivalence requires two one-sided tests (TOST) and typically larger N than NI.
Variance and nuisance parameters. Variance (continuous), control event rate (binary), baseline hazard (time-to-event), and dispersion (count) are often the largest sources of error. Use triangulation: pilot data, literature, meta-analysis, or historical arms. If credible bounds are wide, propose a blinded variance review or blinded sample size re-estimation (BSSR) to adjust N without inflating Type I error.
Allocation ratio and stratification. Unequal randomization (e.g., 2:1) can improve recruitment and safety information at modest power cost, but inflates total N for fixed power. Stratification (e.g., region, baseline severity) may improve balance and precision; ensure the sample size method matches the planned stratified analysis or is robust to moderate imbalance.
Dropout, nonadherence, and missing data. Inflate N for anticipated missing data mechanisms (MCAR/MAR/MNAR) only when missingness reduces information for the estimand. Distinguish between intercurrent events (handled by the estimand strategy) and simple missingness. For time-to-event, incorporate dropout as additional censoring in event-count and follow-up assumptions.
Multiplicity and families of hypotheses. Co-primary endpoints, key secondary endpoints with strong claims, and subgroup-driven labeling require Type I error control. Hierarchical gatekeeping (fixed-sequence), Holm/Hochberg, or graphical alpha-recycling approaches affect required N; so do co-primary definitions (both must be positive vs either). Reflect the chosen strategy in your power calculations—often via simulation.
Special settings.
- Cluster-randomized trials: adjust for intraclass correlation (ICC). Effective N is reduced by the design effect:
DE = 1 + (m − 1)×ICC, wheremis cluster size. Account for cluster size variability. - Crossover designs: within-subject correlation reduces variance; consider period and carryover effects; include washout adequacy assumptions.
- Enrichment designs: if eligibility selects biomarker-positive patients, power for the overall and subset estimands may require different Ns and multiplicity plans.
Worked template (continuous, superiority, ANCOVA). Inputs: expected mean difference = 4 units; SD = 10; baseline-to-post correlation = 0.50; allocation 1:1; two-sided α=0.05; power 90%. Effective SD after ANCOVA is SD×√(1−ρ²) ≈ 8.66. Required N per group from standard formula yields ≈ 85; inflate by 15% for attrition → 100 per group. Document your source for SD and ρ, and a sensitivity table (e.g., SD 8–12; ρ 0.3–0.7).
Design Nuances That Change N: Interims, Adaptive Options, and Real-World Complexities
Group-sequential designs (GSD). Interim analyses can stop early for efficacy or futility while preserving the overall Type I error. Spending functions (e.g., O’Brien-Fleming, Pocock) or alpha-spending with information-based timing define boundaries. GSDs typically increase maximum N modestly (5–15%) but can reduce expected sample size if early signals are strong. Your sample size section should declare the number and timing of looks, the spending function, and the independent data monitoring process.
Blinded and unblinded sample size re-estimation (SSR).
- Blinded SSR updates nuisance parameters (e.g., variance) without peeking at treatment effects; Type I error is maintained. Pre-specify the review window and the recalculation rule (e.g., cap at +20% N).
- Unblinded SSR (promising zone, conditional power rules) can be valid with combination tests or alpha-spending adjustments; requires independent oversight and strict access segregation. Document the algorithm and operating characteristics via simulation.
Adaptive enrichment and population selection. If biology suggests a subgroup may benefit more, pre-specify a selection rule (e.g., biomarker threshold) and a multiplicity-controlled testing strategy (e.g., closed testing with alpha recycling). Sample size must cover both the overall and selected population paths; simulations are usually required to demonstrate power and FWER control.
Noninferiority margin determination. Margins reflect preserved effect relative to historical evidence and clinical judgment. Show the translation between effect metrics (e.g., risk difference ↔ risk ratio ↔ log-HR) and justify with prior trials. Larger margins reduce N but weaken inference; be explicit about trade-offs and include a sensitivity analysis if constancy is questionable.
Assumption drift and operational variability. Event rates can be lower than expected; recruitment may be slower; adherence may fall. Use predictive power or conditional power to inform governance decisions at interims. For time-to-event trials, update projections using blinded event accrual monitoring, not total randomized, to avoid false reassurance.
Unequal follow-up, staggered entry, and calendar-time effects. For survival endpoints, power depends on total events, not just N. Accrual patterns (front-loaded vs back-loaded), loss to follow-up, and competing risks change event yield. Model accrual with piecewise constant rates and consider administrative censoring. If non-PH is plausible (e.g., immuno-oncology), simulate with alternative hazard patterns (delayed effect, crossing hazards).
Multiplicity across families of endpoints. When co-primaries or key secondaries drive labeling, select a strategy (fixed-sequence, gatekeeping, graphical alpha recycling) and integrate it into both analysis and power justification. Simulation can demonstrate that the probability of at least one false claim across families remains ≤α while achieving ≥1−β power for the clinically prioritized claims.
Pediatric and rare disease contexts. Feasible sample sizes may be small. Consider borrowing (Bayesian hierarchical models), response-adaptive randomization with caution, or external controls when ethical and scientifically justified. Even where Bayesian primary analyses are proposed, regulators still expect operating characteristics (frequentist Type I error and power) over the prior space; simulate accordingly.
Decentralized and pragmatic trials. Variance inflation from heterogeneous settings, device differences, or adherence variability should be anticipated. Cluster or stepped-wedge designs require ICC estimates and design-effect adjustments. Pragmatic endpoints drawn from EHR claims may demand misclassification sensitivity analyses that influence effective power.
Evidence Package & Quality Controls: Simulations, Sensitivities, and Inspection-Ready Traceability
Make it reproducible. Provide all parameters, formulas, and software details (package/version, random seeds) so calculations can be replicated. Preserve program code and outputs under change control. Capture point-in-time configuration snapshots at protocol finalization, SAP sign-off, and any amendment that touches endpoints, margins, or alpha spending.
Show the operating characteristics. Tables and figures should include: power curves across effect sizes; sensitivity to variance/event rates; expected vs maximum sample size for GSD/SSR; false-positive control in multiplicity settings; and, for survival designs, event-accrual projections with uncertainty bands. Where analytic formulas are weak (non-PH, complex multiplicity, adaptive enrichment), supply simulation-based operating characteristics.
Assumptions register. Maintain a one-page list of key inputs with provenance: control rate, standard deviation, hazard rate, dispersion, ICC, dropout, adherence, noninferiority margin, multiplicity plan, accrual curve. For each, cite data sources (pilot, meta-analysis, historical registry) and the bounds explored in sensitivity analyses.
Blinding and segregation of roles. If SSR or adaptive features are used, describe who is unblinded (statistician independent of study team, DSMB) and how Type I error protection is enforced (alpha spending, combination tests). Access controls and audit trails should permit an inspector to reconstruct who saw what, when, and why.
Common pitfalls—and durable fixes.
- Underestimated variance or control rates → plan blinded nuisance-parameter checks; cap SSR increases; publish sensitivity ranges.
- Ambiguous estimands → align endpoint definition, intercurrent event handling, and analysis method; avoid powering for one while analyzing another.
- Overlooked multiplicity → integrate gatekeeping or alpha-recycling into power justification; simulate where closed forms are inadequate.
- Non-PH ignored → examine alternative hazard shapes; consider weighted log-rank or milestone analyses; simulate to verify power.
- Optimistic accrual → stress-test with slower site activation and higher screen-fail; tie governance triggers to blinded event accrual.
- Dropout/adherence hand-waving → quantify realistic rates; model their impact on information; distinguish missingness from intercurrent events.
- Unverifiable calculations → lock code, versions, and seeds; archive scripts and outputs; file printed summaries in the TMF.
What to place where.
- Protocol: objectives, estimand(s), primary analysis model, target effect or margin, alpha/power, allocation, interims, and high-level multiplicity strategy.
- SAP: detailed formulas or simulation design; parameter sources; sensitivity ranges; interim/SSR algorithms; spending functions; decision rules.
- Programming/validation: annotated code; double-programming or independent verification; logs; seeds; version snapshots.
- TMF: configuration manifests, audit trails, DSMB charter (if applicable), and evidence of approvals and amendments.
Quality metrics you can track.
- Gap between planned and realized variance/event rate (blinded estimates), with pre-defined action thresholds.
- Recruitment and event-accrual forecast error vs observed, updated monthly.
- Probability of early stop at each interim under null and alternative (from simulations), compared with DSMB outcomes.
- Coverage of sensitivity ranges in governance discussions (how often did we revisit assumptions?).
- Reproducibility check pass rate: can a second statistician regenerate the sample size and OC tables from the archive?
Checklist (study-ready sample size justification).
- Estimand aligned to endpoint and analysis model; intercurrent event strategies explicit.
- Assumptions documented with sources; sensitivity tables/plots provided for key nuisance parameters.
- Multiplicity plan integrated into the power assessment (co-primary, key secondary, subgroups).
- Interim/SSR/adaptive elements pre-specified with Type I error protection and role segregation.
- Dropout/adherence and accrual/event patterns incorporated; inflation or event targets justified.
- Programs and versions archived; simulation seeds recorded; configuration snapshots filed.
- Governance thresholds defined (e.g., blinded variance too high, event accrual too low) with pre-planned responses.
Bottom line. A sample size is a risk contract. When you tie it to a well-defined estimand, choose analysis-consistent methods, examine uncertainty with sensitivity and simulation, and preserve a clean evidence trail, your justification will feel familiar and trustworthy to assessors at the FDA, EMA, PMDA, TGA, within the ICH community, and in line with the WHO public-health mission.