Evaluation of event-free survival as a robust end point in untreated acute myeloid leukemia (Alliance A151614)

Jun Yin, Betsy LaPlant, Geoffrey L. Uy, Guido Marcucci, William Blum, Richard A. Larson, Richard M. Stone and Sumithra J. Mandrekar

Key Points

  • In randomized trials, treatment effect estimates based on the hazard ratios were unaffected by various EFS definitions.

  • In single-arm trials, incorrect conclusions about efficacy could be made if the EFS definition is not consistent with the historical control.


Event-free survival (EFS) is controversial as an end point for speeding approvals in newly diagnosed acute myeloid leukemia (AML). We aimed to examine the robustness of EFS, specifically timing of complete remission (CR) in defining induction failure and impact of hematopoietic cell transplantation (HCT). The study included 1884 untreated AML patients enrolled across 5 trials conducted through Alliance for Clinical Trials in Oncology using anthracycline and cytarabine induction chemotherapy. EFS was defined as time from randomization/registration to induction failure, relapse, or death. Three definitions of induction failure were evaluated: failure to achieve CR by 60 days after randomization/registration, failure to achieve CR by the end of all protocol-defined induction courses, and failure to achieve CR by the end of all protocol-defined treatment. We considered either censoring or no censoring at time of non–protocol-mandated HCT. Although relapse and death are firm end points, the determination of induction failure was not consistent across studies. There was minimal impact of censoring at HCT on EFS estimates; however, median EFS estimates differed considerably based on the timing of CR in defining induction failure, with the magnitude of difference being large enough in most cases to lead to incorrect conclusions about efficacy in a single-arm trial, if the trial definition was not consistent with the definition used for the historical control. Timing of CR should be carefully examined in the historical control data used to guide the design of single-arm trials using EFS as the primary end point. Trials were registered at as #NCT00085124, #NCT00416598, # NCT00651261, #NCT01238211, and #NCT01253070.


Acute myeloid leukemia (AML) is the most common acute leukemia in adults and is among the most lethal. In the United States, the annual incidence of AML is 19 000 cases, and the annual incidence of AML-associated deaths is 10 000.1 Although there has been significant research effort aimed at improving outcomes in AML, standard therapy for most subtypes of newly diagnosed AML remains suboptimal.1,2 Especially among patients age >60 years, outcomes are poor, with a 5-year overall survival (OS) of 10% to 20%; outcomes are even worse among older patients who are unfit for intensive chemotherapy, with a median OS of only 5 to 10 months.1,3,4

In parallel with research on new therapies, emphasis has been placed on new end points other than OS that may facilitate drug development and shorten the time to approval for use in AML.2,5 OS in comparative oncology clinical trials remains the gold-standard end point to assess efficacy of drugs for approval by the US Food and Drug Administration. However, use of OS as an end point requires following up participants until a sufficient number of deaths occur.2,6-8 For example, midostaurin was recently approved for patients with newly diagnosed FLT3-mutated AML based on a clinically significant improvement in OS.9 However, the protocol was amended to perform the primary analysis with OS as the end point without waiting for the 509 originally planned OS events to occur, because the original planned OS events had not yet been reached, likely because of the higher-than-expected rate of hematopoietic cell transplantation (HCT). This protocol amendment also promoted consideration of event-free survival (EFS) as the most clinically meaningful secondary end point for assessment of the effect of midostaurin on the natural course of FLT3-mutated AML. The molecular, immunophenotypic, and biologic heterogeneity of AML has allowed the use of targeted agents tailored to individual patients, but these patient subsets are small, and trial recruitment is slow as a result of patient selection.10,11 This adds additional time to the clinical trial process. Therefore, there is an increased interest in surrogate end points that might provide faster assessments of novel agents to treat AML. EFS is a common alternative end point in AML. It is similar to progression-free survival, which has been validated as a surrogate end point in pivotal randomized trials for solid tumors.12-14 In contrast to OS, where death is the only event of interest, EFS is a composite end point that includes death resulting from any cause, failure to achieve complete remission (CR), and relapse from CR as events. Most failures are from lack-of-remission events or from relapse from CR and occur within 1 year after treatment initiation.15,16 Therefore, EFS can be assessed much faster than OS. In addition, EFS is less affected by HCT or salvage therapy after relapse and takes into account the entire study population.17-19 These postfailure interventions are often unspecified in protocols and potentially bias or dilute the frontline treatment effect, because they can rescue some patients and prolong the survival of others. Therefore, EFS may be a more reliable end point than OS for assessment of the primary treatment effect.

Despite these attractive features, the EFS-based end point is controversial as a means of accelerating AML approvals. This results from the potential confounding factor of treatment failure components of EFS, such as events of induction failure and transplantation, as well as inconsistent definitions of these components. Therefore, in many protocols, alternative definitions, revisions of the primary definition, are used for supportive/sensitivity analysis. The objective of this study was to better understand the different components of EFS and examine the robustness of efficacy results using EFS as the primary endpoint, with varying induction failure definitions and censoring status at the occurence of HCT.

Materials and methods

Trial identification

Since 2003, the Cancer and Leukemia Group B (CALGB; now part of the Alliance for Clinical Trials in Oncology) has completed five prospective trials using anthracycline and cytarabine chemotherapy for newly diagnosed patients with AML and reported their primary end point data. In the current study (A151604), we evaluated 1884 adults with newly diagnosed AML in these five Alliance (CALGB) trials (Table 1), including 2 randomized phase 3 trials (CALGB 10201, N = 506; CALGB 10603, N = 717) and 3 single-arm phase 2 trials (CALGB 10503, N = 546; CALGB 10801, N = 61; CALGB 11001, N = 54).9,20-23 CALGB 10201 enrolled AML patients age ≥60 years, and CALGB 11001 enrolled AML patients age ≥60 years with FLT3-mutated AML. CALGB 10503 and CALGB 10603 enrolled AML patients age <60 years, and CALGB 10603 enrolled patients age <60 years with FLT3-mutated AML. CALGB 10801 enrolled patients age ≥18 years with the favorable cytogenetic risk, core binding factor–positive AML. Cytarabine and daunorubicin were used for remission induction, together with other agents in some studies, and all 5 trials specified a second course of induction therapy for initial nonresponders.

Table 1.

Trial characteristics

Statistical analysis

We evaluated 3 definitions of induction failure according to when CR was assessed: definition 1 (D1), failure to achieve CR by 60 days after registration/randomization; D2, failure to achieve CR by the end of all protocol-defined induction courses; and D3, failure to achieve CR by the end of all protocol-defined treatment. CR was defined, using the established response criteria for AML therapy,24 as <5% blasts in cellular marrow with recovery of >1000 neutrophils per uL (but >1500 neutrophils per uL in CALGB 1020125), >100 000 platelets per uL, and no red blood cell transfusion requirement. EFS was defined as the time from randomization/registration to induction failure using 1 of the 3 definitions, relapse, or death resulting from any cause. Patients who were last known to be alive without an EFS event were censored at the date of last contact. In addition to the 3 definitions of induction failure, we also considered either censoring or not censoring patients who underwent HCT at the time of HCT. EFS was separately estimated in each of the five Alliance (CALGB) trials using the Kaplan-Meier method,26 with data pooled from across the arms for the randomized trials. Medians and 95% confidence intervals (CIs) as well as EFS estimates at predetermined time points were reported. Treatment effects in the randomized trials were assessed by using the different EFS definitions to further understand how these definitions affected trial outcomes.

Each participant signed an institutional review board–approved, protocol-specific informed consent document in accordance with federal and institutional guidelines. Data collection and statistical analyses were conducted by the Alliance Statistics and Data Center. Results analyzed were available in our database as of 13 February 2018.


In total, 1884 adult patients with newly diagnosed AML were included in this study from 3 single-arm and 2 randomized trials (Table 1), with a median follow-up of 57 months. Overall, 63% of patients died and 78% had an event (ie, induction failure within 60 days of treatment initiation, relapse, or death). Among the 5 studies, CALGB 10201 and CALGB 11001 enrolled older patients (median age, 67 and 69 years, respectively), whereas CALGB 10503 and CALGB 10603 enrolled younger patients (median age, 47 and 48 years, respectively; Table 2). Approximately half of the patients were men in all trials except for CALGB 10201 (61% male). Younger AML patients had better performance status. The percentage of deaths was higher in the AML trials with older patients (range, 70%-92%) compared with those with younger patients (range, 50%-56%), with varying length of follow-up on patients still alive. In addition, the percentage of EFS events was higher in the AML trials with older patients (range, 80%-94%) compared with younger AML patients (range, 72%-75%). The patients with favorable cytogenetic risk (core binding factor–positive AML) enrolled in CALGB 10801 had the lowest death rate (21%) and EFS event rate (36%).

Table 2.

Summary of studies

Depending on when CR status was assessed, the induction failure rate differed across D1 to D3. Overall patterns were consistent among the 5 trials (Table 3), where D1 (CR by 60 days from randomization/registration) yielded the highest induction failure rate, D3 (CR by end of all protocol-defined treatment) yielded the lowest induction failure rate, and D2 (CR by end of protocol-defined induction therapy) yielded an intermediate induction failure rate. Differences among definitions within individual trials ranged between 3% and 8% across the 5 trials. These differences were converted to EFS estimates.

Table 3.

EFS estimates determined by using different induction failure definitions with and without censoring at HCT

Median EFS estimates differed considerably according to the timing for CR used to define induction failure (Table 3). Consistently, EFS estimates determined with D3 (end-of-treatment definition) were the highest, EFS estimates determined with D2 (end-of-induction definition) were intermediate, and EFS estimates determined with D1 (60-day definition) were the most conservative. This was expected, because by definition, a longer event-free time would yield better outcomes. Furthermore, when examining the effect over time, we observed that the differences in EFS estimates using the 3 definitions diminished over time (Figure 1). At 1 year after treatment initiation, the estimated EFS rates ranged from 21% to 24% in CALGB 10201, 45% to 48% in CALGB 10503, 37% to 46% in CALGB 10603, and 78% to 83% in CALGB 10801; the estimated EFS rates were 35% for all definitions in CALGB 11001 (Figure 1). For D3 compared with D1, the difference in 1-year EFS reached as much as 9%.

Figure 1.

Kaplan-Meier (KM) curves of EFS by induction failure definition (fail def) D1 to D3, with and without censoring at HCT. KM curves of EFS in CALGB 10201 (A), CALGB 10503 (B), CALGB 10603 (C), CALGB 10801 (D), and CALGB 11001 (E). Green curves represent EFS estimates (est) by D1 (no CR by 60 days after registration/randomization), blue curves represent EFS estimates by D2 (no CR by the end of all protocol-defined induction courses), and red curves represent EFS estimates by D3 (no CR by the end of all protocol-defined treatment). Solid curves represent EFS analysis with censoring at non–protocol-specified HCT, and dashed curves represent EFS analysis without censoring at HCT.

The effect of HCT on EFS estimation was also investigated. The percentages of patients who underwent HCT in first CR or after relapse were 9% in CALGB 10201, 35% in CALGB 10503, 57% in CALGB 10603, 8% in CALGB 10801, and 41% in CALGB 11001 (Table 3). However, the percentage of patients who were actually censored at the time of HCT was reduced by approximately half (range, 3%-28%), thereby indicating that an event of interest (relapse or death) occurred before transplantation in approximately half of the patients who underwent HCT. Given that a majority of patients were not affected by censoring of HCT (ie, their events happened before HCT), results on EFS were similar regardless of their HCT status (Table 3; Figure 1).

Additional sensitivity analyses were conducted by using D2 to calculate EFS, wherein the reported induction end date was replaced with the date of last clinical assessment during induction. Similarly, the last follow-up date was replaced with the clinical assessment date, which was defined as the last contact date and used to calculate EFS. Furthermore, to evaluate the effect of transplantation, only allogeneic HCT was considered instead of all types of HCT. All of these sensitivity analyses yielded similar results and consistent conclusions.

For the 2 randomized phase 3 trials (CALGB 10201, CALGB 10603), analysis with the 3 induction failure definitions resulted in the same overall conclusions about treatment effect. Furthermore, comparisons between the arms in each trial were consistent across the 3 definitions, and the conclusions about treatment effect were not affected by the 3 definitions (Table 4). For example, in CALGB 10603, it was concluded that the addition of multitargeted kinase inhibitor midostaurin to standard chemotherapy significantly prolonged EFS among patients with AML and an FLT3 mutation. In our analysis, EFS estimates were significantly longer in the midostaurin arm than in the placebo arm, with an HR ranging from 0.71 to 0.79 depending on the induction failure definition used. In contrast, the addition of oblimersen to standard chemotherapy failed to improve the outcomes of older AML patients in CALGB 10201, regardless of induction failure definition.

Table 4.

EFS estimates determined by using different induction failure definitions for randomized trials CALGB 10201 and CALGB 10603


Appropriate sensitivity analyses for the primary efficacy end point and the key secondary efficacy end points are often required by regulatory agencies to evaluate the robustness of efficacy results.27 For example, the potential bias caused by timing and scheduling of disease progression assessments has received much attention and is well documented.28 However, specific to AML, no studies so far have systematically considered the potential confounding events; for example, non–protocol-mandated HCT and induction failure leading to changes in treatment. In this analysis, we examined the robustness of EFS in measuring clinical benefit in untreated AML using individual patient data across studies, and we provide recommendations on trial design using EFS as an end point.

Although relapse and death are firm end points, the determination of induction failure is not consistent across studies. Median EFS estimates differed considerably depending on the timing used to define induction failure, and the magnitude of the difference ranged from 14% to 115%. In all 5 studies of untreated AML patients who received standard intensive induction chemotherapies, EFS estimates determined by D3 (failure to achieve CR during the entire protocol treatment) were consistently the highest because of the length of EFS. EFS estimates determined by D1 (failure to achieve CR by 60 days) were the most conservative, and EFS estimates determined by D2 (failure to achieve CR after all induction therapies) were intermediate. This suggests that incorrect conclusions about efficacy could be made in a single-arm trial if the definition of EFS used in the trial were inconsistent with the definition used to determine the historical control estimate. However, in randomized trials, HRs have been shown to be insusceptible to such bias. Therefore, more emphasis should be placed on using HRs to measure clinical benefit instead of the Kaplan-Meier estimates of EFS medians in trials with a concurrent control.

To further explore the reason behind the discrepancy between EFS using D1 vs D2, we specifically studied patients who were considered to have induction failure by D1 but not by D2. For example, in CALGB 10603, 62 patients achieved a CR after 60 days postrandomization, but before their reported end date of all induction courses. Of these 62 patients, only 1 achieved a CR during the first induction; the other 61 patients achieved a CR during their second induction therapy. Because of the timing of their CRs, using D1, these patients were considered to have induction failures with their EFS time equal to 60 days (1.97 months), whereas using D2, they were considered to be in induction remission until relapse or death, with a median EFS of 23.8 months (95% CI, 11.8 to not reached). These data are included in supplemental Table 3. Similarly, when comparing D2 vs D3, we identified 25 patients from CALGB 10603 who achieved a CR postinduction; half of the patients in this group achieved a CR 42 days postinduction (25% to 75% quantile, 23-85 days), suggesting that they were receiving protocol consolidation therapy when they achieved a CR.

Although we evaluated the effect of the timing of remission assessment, a limitation of the study is that we did not evaluate the effect of what constitutes a remission in the definition of EFS. For example, less stringent response criteria include CR without platelet recovery and CR with incomplete hematologic recovery. Another limitation of our study is that we did not take into account adherence to the protocol induction therapy. For example, in another AML trial (H049),6,29 >90% of patients who did not have a CR after the first course of induction received a second course as required by the protocol, whereas in trial S0106,30 the adherence rate was only ∼55%. This may relate to differences in clinical practice; in the United States, physicians may remove a patient from protocol therapy on day 14 if bone marrow biopsy results indicate a high burden of disease, but in Europe, 2 cycles of induction are customarily administered regardless of poor early response. However, because induction failure is delayed when a second course of induction is received by initial nonresponders, the EFS becomes longer just by improving protocol adherence. In addition, subsequent treatment information, including HCT, was not systematically recorded in the clinical trials considered in this analysis, and this is another limitation. Statistical approaches, such as competing risk analysis and multistate models, are more appropriate for investigation of the effect of off-protocol transplantation on these outcomes.

Defining the clinical benefit of therapy in AML is complicated by disease- and treatment-specific considerations. Although CR has been uniformly defined in AML and is a more tractable end point, several recent studies have indicated a dissociation between CR and OS (ie, improvement in CR but no improvement in OS compared with controls) under some conditions.31,32 Minimal (or measureable) residual disease has been used as an alternative end point to support the regulatory approval of agents for both chronic myeloid leukemia and acute lymphoblastic leukemia. The use of minimal residual disease in AML has been hampered by the lack of standardized assessments. However, EFS has the advantage of reaching an end point sooner than OS, but unlike OS, EFS is not influenced by subsequent treatments administered after failure to achieve or maintain a remission and thus provides a more precise assessment of the efficacy of a particular drug.

The analysis presented here is different from a surrogacy analysis of EFS to OS, where 2 levels of correlations (ie, individual patient-level correlation and trial-level correlation) need to be formally established using a meta-analytical approach.33 Ongoing work is focused on collecting data from trials conducted across the National Cancer Institute National Clinical Trials Network to perform a formal surrogacy analysis of EFS to OS.

To conclude, this is by far the largest study (N = 1884) to evaluate the robustness of EFS and consider potentially confounding events using individual patient data collected from 5 CALGB trials. Although analysis of the randomized trials revealed that the trial conclusions remained unaffected, median EFS estimates differed considerably based on the timing of CR used in defining induction failure. The magnitude of difference in these estimates could be large enough in most cases to lead to incorrect conclusions about efficacy in a single-arm trial, if the trial definition were not consistent with the definition used for the historical control. Therefore, the timing of CR should be carefully examined in the historical control data used to guide the design of single-arm trials using EFS as the primary end point.


Supported by the National Cancer Institute, National Institutes of Health, under awards U10CA180821 and U10CA180882 (to the Alliance for Clinical Trials in Oncology), U10CA180833, U10CA180836, U10CA180850, and U10CA180867 and in part by funds from Bayer AG (CALGB 11001), Bristol-Myers Squibb (CALGB 10803), and Novartis (CALGB 10603).

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.


Contribution: J.Y. designed the study, performed the statistical analysis, and wrote the first draft of the manuscript; B.L. performed the statistical analysis and reviewed and revised the manuscript; G.L.U., G.M., W.B., R.A.L., and R.M.S. provided the data and reviewed and revised the manuscript; S.J.M. conceptualized and designed the study, coordinated data collection, and reviewed and revised the manuscript; and all authors participated in the editing of the manuscript and approved the final version.

Conflict-of-interest disclosure: The authors declare no competing financial interests.

Correspondence: Jun Yin, Mayo Clinic, 200 First St SW, Rochester, MN 55905; e-mail: yin.jun{at}


  • The full-text version of this article contains a data supplement.

  • Submitted September 14, 2018.
  • Accepted March 15, 2019.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
View Abstract