Noninferiority and equivalence trials present particular difficulties in design, conduct, analysis, and interpretation.19
Hypotheses
In a superiority trial the null hypothesis is that treatments are equally effective and the alternative hypothesis is that they differ. A type I error is falsely finding a treatment effect when there is none, and a type II error is failing to detect a treatment effect when truly one exists. In noninferiority trials, the null and alternative hypotheses are reversed; a type I error is the erroneous acceptance of an inferior new treatment, whereas a type II error is the erroneous rejection of a truly noninferior treatment.
Design
A noninferiority or equivalence trial requires that the reference treatment's efficacy is established20-21 or is in widespread use so that a placebo or untreated control group would be deemed unethical.
Both participants and outcome measures in a noninferiority or equivalence trial should be similar to those in trial(s) that established the efficacy of the reference treatment. Outcome measures should also be similar to those in previous trials.
The required sample size is calculated using the confidence interval (CI) approach, considering where the CI for the treatment effect lies with respect to both the margin of noninferiority
and a null effect. Sample size depends on the level of confidence chosen, the risk of type II error (or desired power), and
.22-23 A prestated margin of noninferiority
can be specified as a difference in means or proportions or the logarithm of an odds ratio, risk ratio, or hazard ratio. A prestated margin of noninferiority is often chosen as the smallest value that would be a clinically important effect.24 If relevant,
should be smaller than the "clinically relevant" effect chosen to investigate superiority of reference treatment against placebo.25-26 For example, if mortality with treatment A is better than placebo by 10% (absolute difference), a new treatment B might need to be at least 5% better than placebo (and thus no more than 5% worse than A). The required size of noninferiority trials is therefore usually larger than that for superiority trials.25 Unfortunately, sample sizes for noninferiority and equivalence trials are often too small.19, 24 Given several previous trials, the effect of the reference treatment can be estimated from a meta-analysis. There are several techniques to determine
,27 its magnitude being influenced by several factors, eg, efficacy, safety, cost, acceptability, and adherence.28-29
Conduct
Trial conduct should closely match any trial that demonstrated efficacy of the reference treatment, provided they were of high quality.20 One should avoid features that might dilute true differences between treatments, thereby enhancing the risk of erroneously concluding noninferiority,30-31 eg, poor adherence, dropouts, recruitment of patients unlikely to respond, and treatment crossovers.
Analysis
Although a modified hypothesis testing framework exists,32-33 a more informative CI approach is preferred in the design, analysis, and reporting of noninferiority and equivalence trials.34
For superiority trials, intention-to-treat (ITT) analysis (analyzing all patients within their randomized groups, regardless of whether they completed allocated treatment) is recommended.3 Intention-to-treat analysis often leads to smaller observed treatment effects than if all patients had adhered to treatment. In noninferiority trials, ITT analysis will often increase the risk of falsely claiming noninferiority (type I error),25 although not always.35 In practice, ITT analysis is often not possible and one uses a "full analysis set" to describe that patient follow-up, which is "as complete . . . and . . . close as possible" to ITT.36
Alternative analyses that exclude patients not taking allocated treatment or otherwise not protocol-adherent could bias the trial in either direction.37 The terms on-treatment or per-protocol analysis are often used but may be inadequately defined. Potentially biased non-ITT analysis is less desirable than ITT in superiority trials but may still provide some insight. In noninferiority and equivalence trials, non-ITT analyses might be desirable as a protection from ITT's increase of type I error risk (falsely concluding noninferiority).36 There is greater confidence in results when the conclusions are consistent.
Subgroup analysis requires the same caveats in noninferiority trials as it requires in superiority trials. Interim analyses in noninferiority trials have some differences in rationale from superiority trials. If noninferiority is established before the trial is completed, there may be no ethical requirement to stop early because of lack of efficacy.19 However, other advantages (adverse effects, cost) could justify stopping the trial, to expedite availability of the new treatment. If a treatment is clearly inferior, then stopping the trial (or a particular trial arm) is ethically justified.38-40 Stopping rules might be asymmetric, a trial being allowed to continue longer if the new treatment appears superior,41 although this result is unlikely.19
Interpretation
Interpreting a noninferiority trial's results depends on where the CI for the treatment effect lies relative to both the margin of noninferiority
and a null effect. The observed treatment effect is not by itself sufficiently informative. With 2-sided equivalence the interpretation is analogous, but both margins
and
need considering, and claiming equivalence requires the CI to lie wholly between
and
.
Many noninferiority trials based their interpretation on the upper limit of a 1-sided 97.5% CI, which is the same as the upper limit of a 2-sided 95% CI. Although both 1-sided and 2-sided CIs allow for inferences about noninferiority, we suggest that 2-sided CIs are appropriate in most noninferiority trials.29 If a 1-sided 5% significance level is deemed acceptable for the noninferiority hypothesis test42 (a decision open to question), a 90% 2-sided CI could then be used. The Figure interprets several possible scenarios with 2-sided CIs for a noninferiority trial.
|
|
|
|
Figure. Possible Scenarios of Observed Treatment Differences for Adverse Outcomes (Harms) in Noninferiority Trials
Error bars indicate 2-sided 95% confidence intervals (CIs). Tinted area indicates zone of inferiority. A, If the CI lies wholly to the left of zero, the new treatment is superior. B and C, If the CI lies to the left of and includes zero, the new treatment is noninferior but not shown to be superior. D, If the CI lies wholly to the left of and wholly to the right of zero, the new treatment is noninferior in the sense already defined, but it is also inferior in the sense that a null treatment difference is excluded. This puzzling case is rare, since it requires a very large sample size. It can also result from having too wide a noninferiority margin. E and F, If the CI includes and zero, the difference is nonsignificant but the result regarding noninferiority is inconclusive. G, If the CI includes and is wholly to the right of zero, the difference is statistically significant but the result is inconclusive regarding possible inferiority of magnitude or worse. H, If the CI is wholly above , the new treatment is inferior.22, 43 *This CI indicates noninferiority in the sense that it does not include , but the new treatment is significantly worse than the standard. Such a result is unlikely because it would require a very large sample size.
This CI is inconclusive in that it is still plausible that the true treatment difference is less than , but the new treatment is significantly worse than the standard.
|
|
|
Once noninferiority is evident, it is acceptable to then assess whether the new treatment appears superior to the reference treatment, using an appropriate test or CI (ie, not just the point estimate), preferably defined a priori and with an ITT analysis.22, 28, 43
It is inappropriate to claim noninferiority post hoc from a superiority trial unless clearly related to a predefined margin of equivalence. That is, both superiority and noninferiority hypotheses need explicit specification in the trial protocol.44 It is, however, always reasonable to interpret a CI as excluding an effect of a particular prestated size.45 Having demonstrated noninferiority against reference treatment, some authors then make claims for efficacy of a new treatment relative to placebo by also using evidence from earlier trials of reference treatment vs placebo.46 Such inferences assume assay constancy, ie, current and earlier trials are identical in all relevant aspects,20 eg, participants, outcomes definition, and use of standard therapy. Regarding patient populations, for example, this implies no differences in the effect of treatment across subgroups or similar distribution of relevant subgroups. In the absence of assay constancy, an adjustment method has been proposed.27 Since assay constancy is inevitably questionable, any claims regarding efficacy of new treatment relative to placebo require cautious interpretation.
How Common Are Noninferiority and Equivalence Trials?
We build on the work of McAlister and Sackett37 in modifying the CONSORT checklist2-3 (Table), especially items 1 to 7, 12, 16, 17, and 20. New text is shown in italics. For each modification, we include 1 or more examples of good reporting (and further elaboration where appropriate). In some examples, we have added text in brackets to explain the context. We mainly concentrate on noninferiority trials but make some reference to equivalence trials which are much less common.
|
|
|
|
Table. Checklist of Items for Reporting Noninferiority or Equivalence Trials (Additions or Modifications to the CONSORT Checklist are Shown in Italics)
|
|
|
Title and Abstract
Title and Abstract: Item 1. How participants were allocated to interventions (eg, random allocation, randomized, or randomly assigned), specifying that the trial is a noninferiority or equivalence trial.
Title. "Oral Pristinamycin versus Standard Penicillin Regimen to Treat Erysipelas in Adults: Randomised, Noninferiority, Open Trial"54
Abstract. "DesignMulticentre, parallel group, open labelled, randomised noninferiority trial."54
Introduction
Background: Item 2. Scientific background and explanation of rationale, including the rationale for using a noninferiority or equivalence design.
Example. "Up to 40 million children worldwide are estimated to suffer from vitamin A deficiency. . . . A dose of 200,000 IU retinyl palmitate to children over 1 year old is most widely used and has generally been regarded as safe and potentially effective. . . . In developing countries, animal products that provide retinyl esters are too expensive. . . . Vegetables and fruit . . . are cheap and good sources of vitamin A in the form of beta carotene. . . . Beta carotene is also considered to be virtually non-toxic. . . . In a preliminary study, . . . after 20 days there was a reversion of the clinical and subclinical signs of vitamin A deficiency in the study group. . . . Since beta carotene is the principal source of vitamin A in developing countries and is non-toxic, we compared retinyl palmitate and beta carotene for treatment of vitamin A deficiency."55
Elaboration. The rationale should cite evidence for the efficacy of the reference treatment. If previous trials, or their systematic review, demonstrate the superiority of the reference treatment relative to placebo, they should be cited with effect sizes and CIs. If no such trials exist, other evidence for efficacy of the reference treatment should be given. Evidence for other advantages of the new treatment over the reference treatment, if present, should be given, to justify use of the new treatment, if not inferior. One aim of the current trial might be to provide or support such evidence. In the case of "me-too" drugs, it should be clear whether there are other advantages.
Methods
Participants: Item 3. Eligibility criteria for participants (detailing whether participants in the noninferiority or equivalence trial are similar to those in any trial[s] that established efficacy of the reference treatment) and the settings and locations where the data were collected.
Example. "From Sept 1, 1992, to Dec 30, 1994, we enrolled 6628 men and women in 312 health centres in Sweden . . . who had hypertension (blood pressure
180 mm Hg systolic,
105 mm Hg diastolic, or both), aged 70-84 years. The only difference in inclusion criteria between this trial and the STOP-Hypertension trial was that patients with isolated systolic hypertension could be included in STOP-Hypertension-2, based on previous positive findings in patients with isolated systolic hypertension treated with diuretics and calcium antagonists."56
Elaboration. Relevant changes in participants' characteristics compared with previous trial(s) should be reported and explained. Clinical trial participants differ, mainly if time has elapsed between trials; therefore, such description should concentrate in relevant departures (that might affect response to treatments).
Interventions: Item 4. Precise details of the interventions intended for each group, detailing whether the reference treatment in the noninferiority or equivalence trial is identical (or very similar) to that in any trial(s) that established efficacy, and how and when they were actually administered.
Example. "[W]e randomly assigned women about to deliver vaginally to receive 600 µg misoprostol orally or 10 IU oxytocin intravenously or intramuscularly, according to practice. . . . The use of uterotonic agents [oxytocin, a type of uterotonic, is the reference treatment] in the management of the third stage of labour reduces the amount of bleeding and the need for blood transfusion . . . "57
(The authors reference a Cochrane systematic review, showing that uterotonic agents reduced bleeding and blood transfusions compared with placebo.)
Elaboration. Any differences between the control intervention in the trial and the reference treatment in the previous trial(s) in which efficacy was established should be reported and explained. For example, differences may exist because background treatment and patient management change with time and concomitant therapies may differ.27 Dose changes may occur: if the dose of the reference treatment is reduced, it might result in reduced efficacy; if it is increased, possibly leading to tolerability problems, the new treatment's advantages could be overestimated.
Objectives: Item 5. Specific objectives and hypotheses, including the hypothesis concerning noninferiority or equivalence.
Example. "[A] bodyweight-adjusted single bolus of 0.50-0.55 mg/kg tenecteplase would be equivalent to a 90 min regimen of alteplase for efficacy and safety [the primary endpoint for efficacy was all-cause 30-day mortality from acute myocardial infarction]. In this double-blind, randomised, controlled study, we formally tested this hypothesis."13
Elaboration. The authors should specify for which outcomes noninferiority or equivalence hypotheses apply and for which superiority hypotheses apply. Usually the noninferiority or equivalence hypothesis refers to the primary end point, whereas the new treatment is expected to offer other advantages, eg, fewer adverse effects, cost.
Outcomes: Item 6. Clearly defined primary and secondary outcome measures, detailing whether the outcomes in the noninferiority or equivalence trial are identical (or very similar) to those in any trial(s) that established efficacy of the reference treatment and, when applicable, any methods used to enhance the quality of measurements (eg, multiple observations, training of assessors).
Example. "Over the past decade seven large, randomised, placebo-controlled trials involving a total of 16,770 patients who underwent percutaneous interventions have established that the overall reduction in the risk of death or nonfatal myocardial infarction 30 days after adjunctive inhibition of platelet glycoprotein IIb/IIIa receptors is 38 percent. Three glycoprotein IIb/IIIa inhibitors were assessed in these trials. The primary end point [in the present trial] was a composite of death, nonfatal myocardial infarction, or urgent target-vessel revascularization within 30 days after the index procedure."58
Elaboration. Any differences in outcome measures in the new trial compared with trial(s) that established efficacy of the reference treatment should be noted and justified. In particular, note any changes in timing of evaluation. Ideally, outcomes should remain unchanged, but often insights do lead to change as the understanding, management, and prognosis of a disease improve. For example, early AIDS trials used death outcomes, then deaths became uncommon, so they shifted to AIDS clinical events, then clinical events became uncommon, so they shifted to surrogate markers.
Sample Size: Item 7a. How sample size was determined, detailing whether it was calculated using a noninferiority or equivalence criterion, and specifying the margin of equivalence with the rationale for its choice.
Examples. "Considering previous studies, a primary event rate of 3.1% per year was estimated for patients in both treatment groups. To obtain 90% statistical power with a 1-sided
equal to 0.025, approximately 1600 patient-years of exposure per treatment groups are necessary to establish the noninferiority of ximelagatran compared with dose-adjusted warfarin within 2% per year. . . . Assuming an average follow up of 16 months, approximately 2400 patients are required."59
"Sample size was based on . . . [an] 8.0% primary quadruple end point event rate in the control (heparin plus Gp IIb/IIIa blockade) group [reference treatment] and a 12.5% relative reduction in the bivalirudin arm. Using a 2-sided
level of .05 and 3000 patients per group, the trial had a 99% power to detect superiority over the imputed heparin control [historical control] and a 92% power to satisfy noninferiority criteria relative to heparin plus Gp IIb/IIIa."46
Elaboration. The margin of noninferiority or equivalence should be specified, and preferably justified on clinical grounds. Its relation to the effect of the reference treatment relative to placebo in any previous trials should be noted (see second example).
Sample size calculations are usually based on the assumption that the point estimate of the difference between treatments will be 0 (as in the first example above). Examples F and G in the Figure would have met the noninferiority criterion had the observed point estimates been 0. That is, the precision of the estimates would have been adequate, had the 2 treatments been equally effective. With a large enough sample, it is possible to demonstrate noninferiority even when the point estimate is between 0 and
. If the true effect is assumed to be greater than 0, the sample size will need to be increased, perhaps substantially.
Stopping Rules: Item 7b. When applicable, explanation of any interim analyses and stopping rules (and whether related to a noninferiority or equivalence hypothesis).
Example. "Interim safety analyses were planned when 40 and 70 percent of the total number of women had been enrolled. An increased rate of HIV transmission associated with the shorter regimens, as compared with the long-long regimen, would be considered significant if any of the nominal P values for the differences were less than 0.007 in the first interim analysis and less than 0.012 in the second. . . . "38
Elaboration. It is customary to base interim stopping criteria on P values, and these adjusted P values are analogous to widened CIs.
Statistical Methods: Item 12. Statistical methods used to compare groups for primary outcome(s), specifying whether a 1- or 2-sided confidence interval approach was used. Methods for additional analyses, such as subgroup analyses and adjusted analyses.
Examples. Binary outcome. "The proportion of the intention-to-treat population experiencing primary events per year for both treatment groups, and the associated 1-sided 97.5% CI for the difference, will be estimated using the time to first event . . . The noninferiority margin (
) defined in the primary analysis is based on absolute event rate differences . . . Noninferiority of ximelagatran over warfarin will be accepted [in a 0.025 level test] if the upper bound of the 97.5% CI around the estimated difference in primary event rates lies below
. For these studies, an absolute
of 2% was adopted. . . . "59
Continuous outcome. "Regimens were regarded as equivalent if the difference between treatments in change in FEV1 (using 95% CI) was less than 4% of predicted FEV1 . . . Since we were undertaking an equivalence study, the primary analysis was per protocol but an intention-to treat analysis was also undertaken. The mean difference between treatments and 95% CI for the true difference was obtained from analysis of variance, with adjustment for centre and type of clinic. . . . "15
Elaboration. The upper bound of the 1-sided (1
) x 100% CI (or correspondingly, the upper bound of the 2-sided (1
/2) x 100% CI) for the treatment effect has to be below the margin
to declare that noninferiority has been shown, with a significance level
. Both
and
should be prespecified in the noninferiority hypothesis.
Results
Numbers Analyzed: Item 16. Number of participants (denominator) in each group included in each analysis and whether "intention-to-treat" and/or alternative analyses were conducted. State the results in absolute numbers when feasible (eg, 10/20, not just 50%).
Example. "Efficacy variables were analyzed on an intent-to-treat basis . . . and on an as-treated basis. In the intent-to-treat analysis, patients were considered treatment failures if they made any treatment changes, prematurely discontinued randomized treatment for any reason, or had missing data for 2 consecutive evaluations. In the as-treated analysis, only data from patients continuing randomized treatment were considered for analysis."60
Outcomes and Estimation: Item 17. For each primary and secondary outcome, a summary of results for each group and the estimated effect size and its precision (eg, 95% confidence interval). For the outcome[s] for which noninferiority or equivalence is hypothesized, a figure showing confidence intervals and margins of equivalence may be useful.
Examples. Inferiority of new treatment, figure legend. "Relative risk of blood loss of 1000 mL or more with misoprostol compared with oxytocin [1.39, 95% CI 1.19 to 1.63]. Vertical dotted lines represent margins of clinical equivalence determined a priori [0.74 and 1.35 on the relative scale]. Solid line represents null effect."57
(A figure similar to case G in the Figure was presented on the relative scale.)
Noninferiority of new treatment. "The primary quadruple composite end point of death, myocardial infarction, urgent repeat revascularization, or in-hospital major bleeding by 30 days occurred in 299 (10.0%) of 2991 patients in the heparin plus Gp IIb/IIIa inhibitor group vs 275 (9.2%) of 2975 patients in the bivalirudin group (OR, 0.92; 95% CI, 0.77-1.09; P=0.32). Relative to heparin alone, the imputed OR was 0.62 (95% CI, 0.47-0.82), satisfying statistical criteria for noninferiority to heparin plus Gp IIb/IIIa blockade and superiority to heparin alone."46
(A figure similar to case B in the Figure was presented on the relative scale but without the margin of noninferiority.)
Elaboration. In the first example the new treatment was inferior, but it was uncertain whether the treatment effect was smaller or larger than the margin of equivalence 1.35. The second example demonstrated noninferiority.
Comment
Interpretation: Item 20. Interpretation of the results, taking into account the noninferiority or equivalence hypothesis and any other trial hypotheses, sources of potential bias or imprecision and the dangers associated with multiplicity of analyses and outcomes.
Examples. Concluding noninferiority. "According to our definition of equivalence, the efficacy of the . . . long-short regimen (was) statistically equivalent to the efficacy of the long-long regimen . . . The upper limit of the 95 percent confidence interval for the difference between the rates in the two groups was 5.3 percent (close to the boundary of 6.0 percent)."38
Concluding inferiority of new drug (or conventional superiority of reference drug). "Although the trial was intended to assess the noninferiority of tirobifan as compared with abciximab, the findings demonstrated that tirobifan offered less protection from major ischemic events than did abciximab. . . . In order to meet the present definition of equivalence, the upper bound of the 95% confidence interval of the hazard ratio for the comparison of tirofiban with abciximab had to be less than 1.47. . . . The primary endpoint occurred more frequently among the 2398 patients in the tirofiban group than among the 2411 patients in the abciximab group (7.6 percent vs. 6.0 percent; hazard ratio, 1.26; . . . two-sided 95 percent confidence interval of 1.01 to 1.57, demonstrating the superiority of abciximab over tirofiban; P=0.038)."58
Concluding noninferiority of new drug from a trial designed to assess superiority. "The SYNERGY protocol prespecified that if enoxaparin was not demonstrated to be superior to unfractionated heparin, a noninferiority analysis was to be performed. . . . Enaxoparin was not superior to unfractionated heparin but was noninferior for the treatment of high-risk patients with non-ST-segment elevation ACS."44
Comment