AIDS:
23 October 2010 - Volume 24 - Issue 16 - p 2423–2431
doi: 10.1097/QAD.0b013e32833dd0ec
Editorial Review
Survival analysis in infectious disease research: describing events in time
Cole, Stephen Ra,c; Hudgens, Michael Gb,c
aDepartment of Epidemiology, USA
bDepartment of Biostatistics, USA
cCenter for AIDS Research, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, North Carolina, USA.
Received 2 March, 2010
Revised 29 June, 2010
Accepted 30 June, 2010
Correspondence to Dr Stephen R. Cole, University of North Carolina, Chapel Hill, North Carolina, USA. Tel: +1 919 966 7415; fax: +1 919 966 2089; e-mail: cole@unc.edu
Survival analysis methods can be used in infectious disease research to describe the occurrence and timing of clinical or other events subject to censoring and truncation. Here, the survival, hazard, and cumulative hazard functions are defined and simple nonparametric estimators are provided using an illustrative example of survival after AIDS diagnosis. An understanding of these foundational measures is central for an informed use of the survival analysis methods common in infectious disease research.
Survival analysis is a set of methods that can be used to describe the occurrence and timing of clinical or other events [1–4]. The target of inference for survival analysis is the time between an origin and event. For instance, in the example below we are interested in the survival time from AIDS diagnosis until death. Survival analysis is crucial when observed data are censored or truncated. Censoring and truncation are common in infectious disease research. Censoring occurs when we do not know the exact time of an event, but we do know the event occurred before or after a known time, or within a given interval. Here we consider only censoring after a known time (i.e., right censoring), which is the most common form of censoring in biomedical research. Such censoring may be due to completing the study free of the event (i.e., administrative censoring) or loss to follow-up free of the event but before completing the study (i.e., drop out). Truncation occurs when we do not observe individuals with event times that are smaller or larger than certain values. Here we consider only left truncation (i.e., not observing individuals with small event times), which is the most common form of truncation in biomedical research. Such truncation occurs when we begin observation after some or all individuals have already been at risk for the event of interest. For example, in Fig. 1, we present left truncated and right censored data for 78 men followed from the later of AIDS diagnosis or 1995 through the earliest of death, drop out, or 1998.
Existing introductions to survival analysis (e.g., [5,6]) typically ignore truncation, move quickly to group comparisons, and do not concentrate on infectious diseases. The intent of this paper is to clarify foundations for survival analysis that are standard in modern statistics, but not universally understood by clinicians and (nonstatistician) scientists working in infectious diseases. We concentrate on clarifying foundations of survival analysis, a prerequisite for group comparisons. First, we provide a motivating example.
Motivating example: survival after AIDS
Say we are interested in describing survival after AIDS. In this example, the origin is AIDS diagnosis and the event is all-cause mortality. The time between the origin and event is AIDS duration. Suppose we conducted a cohort study and enrolled 42 men alive on 1 January 1995 with a prior clinical AIDS diagnosis. These 42 men constituted a ‘prevalent’ or left-truncated cohort, because the 42 men began observation after their origin (i.e., AIDS diagnosis) for the event death after AIDS. The median (quartiles) AIDS duration at study enrollment for these 42 men was 1.42 (0.34–2.18) years.
We then prospectively enrolled 36 additional men beginning at their clinical AIDS diagnosis between 1 January 1995 and 1 January 1998. We followed all 78 (= 42 + 36) men for all-cause mortality through 1 January 1998, the date of study completion. Typically, a minimum amount of time between study enrollment and the date of study completion is required, but here we made no such requirement. The 36 additional men constituted an ‘incident’ cohort, because each of the 36 additional men was observed from their origin (i.e., AIDS diagnosis) for the event death after AIDS. Here ‘prevalent’ and ‘incident’ are used in conjunction with the origin of the time at risk, rather than their typical use in conjunction with the event.
To carefully describe survival analysis methodology, we use the following mathematical notation. A symbol key is presented in Table 1. In general, uppercase letters denote random variables and lowercase letters denote possible realizations of random variables, or constants. Let W denote the years from AIDS diagnosis to study enrollment. For incident AIDS patients, who are enrolled at AIDS diagnosis, W is equal to 0. Let T denote the years from AIDS diagnosis to death and let C denote the years from AIDS diagnosis to right censoring. In practice, we only get to observe the minimum of T and C, which we denote by T* = min (T,C). Let D equal to 1 if death occurred before censoring and D equal to 0 otherwise. Finally, let the subscript i = 1 toN index the N = 78 men. For instance, Di = 1 if individual i died during follow-up (i.e., T*i = Ti).
Data for these 78 men are provided in Table 2. Reading from Table 2, individual 1 was diagnosed with AIDS on 1990.425, enrolled in the study W1 = 4.575 years after AIDS diagnosis, and remained alive (D1= 0) at study completion on 1998.0 at T*1 = 7.575 years after AIDS diagnosis. These data were obtained from version 11 of the Multicenter AIDS Cohort study [7] public use dataset. Dates in the public use data are given by month and year, so we randomly selected the day of the month to obtain exact dates. Randomly assigning dates in practice is not recommended in general. The intention here is to illustrate methods using data in the typical form, while both allowing public consumption of the example data and protecting anonymity. Dates will be available in most infectious disease research settings and are available in the nonpublic Multicenter AIDS Cohort Study data.
Each man's time on study is the difference between study entry and exit, that is, T*i – Wi years. Person-years at risk are defined as the sum of each man's time on study, or
Immortal person-time in a study occurs when the individual contributing that person-time could not, by study criteria, have died or been censored: immune person-time is similar, except it pertains to outcomes other than death [8,9]. In our example, there were
person-years at risk for death under observation and
immortal person-years contributed by the 42 men in the prevalent cohort.
Figure 1 presents a pair of time line diagrams for the 78 men, wherein the vertical axis is an individual identifier, which ranges from 1 to 78. InFig. 1(a), the horizontal axis is calendar year. A man's time line begins at the date of AIDS diagnosis. If this date is before 1 January 1995, the line is dashed, denoting immortal time as the number of years between AIDS diagnosis and study entry (W); after 1 January 1995, the line is solid, denoting time at-risk for death after AIDS during the study. The 27 deaths are denoted by time lines that end with dots. The seven drop outs are denoted by time lines that end before the administrative censoring date of 1 January 1998. For instance, reading from Table 2, individual 11 was diagnosed with AIDS on 1992.825, enrolled in the study W1 = 2.175 years after AIDS diagnosis, and remained alive (D1 = 0) when lost to follow-up on 1996.667 at T*1 = 3.842 years after AIDS diagnosis. In Fig. 1(b), the horizontal axis is years from AIDS diagnosis. Time line plots like those in Fig. 1 provide an important role in the exploratory analysis of survival data. Data quality is easily assessed using such simple data displays that convey a single visualization of pertinent aspects of the entire dataset. Next we review standard survival analysis methods that can be used to draw inferences from survival data such as our AIDS example.
The survival, hazard, and cumulative hazard functions
Survival is a function of time and is typically denoted by S(t). Survival is the probability that the random variable T is greater than some specified time t. A formal definition is provided in appendix A. In the AIDS example, survival at time t is the probability of not dying within tyears of AIDS diagnosis.
The hazard is the instantaneous rate of events at time t and is typically denoted by h(t). The hazard at time t is a ratio of the probability density of an event to the survival both at time t. In discrete time, the hazard at time t is the conditional probability of an event at, given survival to, time t. Returning to the AIDS example, the hazard is the rate at which men in the population die t years after AIDS diagnosis.
The cumulative hazard function is defined as a sum (or integral) of the hazard function over time and should not be confused with the complement of the survival function (see appendix A). Plots of the log of the cumulative hazard function are useful in choosing among candidate parametric survival regression models [10], as well as for assessing the proportional hazards assumption when using Cox regression models [11]. Next we describe how to obtain estimates of the survival, hazard, and cumulative hazard functions using the example AIDS data.
Estimators of the survival, hazard, and cumulative hazard functions
We concentrate on nonparametric estimators, which do not posit a parametric form for the survival function (more detail is provided in the Discussion). Again, formal definitions are provided in appendix A. To define nonparametric estimators, we rank-order and number the distinct observed event times Ti as shown in Table 3 in the leftmost columns. Let Rk denote the kth ranked event time. In our example, we have one tied event time T44 = T46 = 1.619 years from AIDS diagnosis, so whereas there are 27 events, there are only 26 rows inTable 3 (plus one row for k = 0). Let Yk be the number of individuals who died at the kth ranked event time. In Table 3, the number of events, Yk, is equal to 1, except for the 12th event time, wherein there are two deaths (i.e., Y12 = 2).
Let Nk be the number of individuals at-risk for mortality while under observation at the kth ranked event time. Nk is the size of the ‘risk set’ at the kth ranked event time. Note an individual is immortal for any event between the origin at time 0 and study entry at time Wi and is, therefore, not included in the risk set until they enter the study. However, individuals who are censored exactly coincident with an event at time Rk are considered to be at risk at that time and, therefore, are included in the risk set Nk. In Table 3, Nk ranges from 45 individuals at-risk at 0.962 years from AIDS diagnosis to 11 individuals at-risk at 4.688 years from AIDS diagnosis.
A nonparametric estimator of the survival function is the product-limit or Kaplan–Meier estimator [12], which is defined as a cumulative product of the estimated probability of not incurring an event (see appendix A). Note, because the size of the risk sets Nk account for late entries (i.e., Wi ≥ 0), this is sometimes called the extended Kaplan–Meier estimator [13]. This Kaplan–Meier estimator implicitly imputes unseen truncated events due to some individuals entering follow-up after the origin and imputes event times for individuals who are censored from follow-up without the event [14,15]. The Kaplan–Meier estimator steps at event times and is flat elsewhere. There are other nonparametric estimators of the survival function (see appendix A), but the Kaplan–Meier estimator is most commonly used in the biomedical literature. A variance estimator for the survival function is given in appendix B.
In Table 3, the Kaplan–Meier survival estimate ranges from 1 at AIDS diagnosis down to 0.425 at 4.688 years from AIDS diagnosis. For example, the survival at R2 = 0.791 years after AIDS is 0.954 and is obtained as:
In Fig. 2(a), we plot the Kaplan–Meier survival function estimates and point-wise 95% confidence limits. Reading from Table 3 or Fig. 2(a), the first quartile and median times from AIDS diagnosis to death were about 1.6 and 3 years, respectively. The estimated first quartile of mortality of 1.6 years is similar to the 1.78 years (95% confidence limits: 1.29–2.44) reported by Schneider et al. [16] for a similar time period using data from the Multicenter AIDS Cohort study. Figure 2(b) is the estimated cumulative probability of death (i.e., the complement of the estimated survival function), which is often presented in place ofFig. 2(a). Looking at Fig. 2(b), the estimated probability of death at 2 years from AIDS diagnosis is about 37%.
The hazard h(t) at the kth ranked event time can be estimated by a ratio of the number of deaths to the product of the number at risk and the time interval since the prior event. In Table 3, the hazard estimates range from 0.0435 at 0.791 years from AIDS diagnosis to 9.009 at 1.258 years from AIDS diagnosis. For example, the hazard estimate atR2 = 0.791 years after AIDS was obtained as:
A smooth function of the hazard is often preferred, because the individual hazard estimates tend to be unstable. In Fig. 2(c), we plot a kernel smooth estimate of the hazard [3,4,17]. The smoothed hazard appears to decrease between 1.5 and 4 years after AIDS diagnosis. Finally, in Fig. 2(d), we plot the estimator of the cumulative hazard and point-wise 95% confidence limits.
Discussion
In this paper, we discussed standard methods for description of survival data with particular attention to right censoring and left truncation. A series of important caveats are discussed below.
First, in randomized trials, all individuals begin observation at a natural origin, namely, randomization. In observational cohort studies, there often exist multiple possible origins. Some examples include study entry, birth, and disease onset or diagnosis. In such settings, time on study does not carry biologic meaning, unless study enrollment coincides with the origin of interest. Fortunately, the nonparametric estimators described above easily handle left truncation due to late entries. Note that when the data consist of a combination of incident (i.e., W = 0) and prevalent (i.e., W > 0) individuals, one might ignore the prevalent individuals and still obtain consistent estimates of the survival function. However, discarding data from the prevalent cohort may lead to a loss of precision because of the reduced sample size and a potential shortening of the time period over which the survival function can be estimated because prevalent individuals typically have longer survival times.
Second, selection bias may arise whenever we select a subset of individuals for analysis. In our example, we restrict to men alive 1 January 1995 and our results may or may not generalize to men alive in other calendar periods. In a substantive analysis, this and any other restriction would have to be scrutinized before drawing inference about a particular population. Selection bias may also arise if individuals who enter follow-up W years after AIDS diagnosis do so in an informative manner. For example, late entry is informative if the hazard of the event is associated with the entry time [18]. In our example of 36 incident (i.e., W = 0) and 42 prevalent (i.e., W > 0) individuals, one simple way to assess selection bias due to informative late entry is to restrict the analysis to the 36 incident individuals. If we do so in our example, the time to the first quartile of mortality (i.e., 25% dead) is about 1.47 years, which is similar to the 1.6 years in the complete sample (we cannot compare median survival times because by 1 January 1998, less than half of the 36 incident individuals had died). Therefore, the late entry does not appear to be informative in these data. Finally, selection bias may also arise whenever individuals are selected out of the study over time. In our example, 44 individuals are administratively censored on 1 January 1998 and seven individuals are censored at drop out. Administrative censoring may be unrelated to the hazard of death here and in many other examples, but drop out may be informative. For example, in the AIDS example, if men who were ill and more likely to die were also more likely to drop out, then we would obtain an upwardly biased estimate of the survival function. One simple way to assess the possible effects of informative drop out is to calculate bounds [19]. For instance, in the AIDS example, we can estimate the survival function if all seven individuals who dropped out were immediate events, as well as if all seven were still alive at the date of administrative censoring. In our example, the estimated median survival was about 3 years in the observed data, would have been about 2.5 years, had all seven drop outs immediately died, and about 3 years, had all seven been alive on 1 January 1998. Therefore, there is some sensitivity to drop out in these data. In the presence of substantial (e.g., > 20%) drop out, bounds may become uninformative due to their width and a formal sensitivity analysis [20] may be required.
Third, when the origin or event dates are known only up to an interval and the length of a typical interval is sizable compared with the typical time from the origin to event, then survival analysis methods that allow for interval censoring should be employed [21,22]. However, when intervals are small compared to the typical time from the origin to event, simply taking the midpoint of the interval may suffice [23].
Fourth, there may be events not of central interest that compete with the event of interest. Competing risks are events other than the event of interest that remove an individual from the risk set and preclude the event of interest from occurring. This is distinct from preclusion of the event being observed due to censoring. In our motivating example, the event of interest was all-cause mortality and there were no competing risks. However, if the event of interest had been AIDS-related mortality, then any non-AIDS deaths would have been competing risks. To obtain a valid estimate in the presence of competing risks, methods that explicitly allow for competing risks should be employed [24–26]. Failure to use methods tailored to competing risks will typically overestimate the probability of the event of interest occurring. Such overestimation results because standard methods, such as those presented here, would treat competing risks as censored events of interest.
Fifth, we did not present parametric estimators of the hazard or survival function. Parametric estimators assume that the survival time T follows a distribution that is defined using a finite number of parameters. Examples of common parametric survival distributions include the lognormal and Weibull. The widely used nonparametric estimators described here do not assume that the survival time T follows a particular specified distribution. To that end, the nonparametric estimators are robust, in that they allow the distribution to be general. However, when the data do cohere with the shape of a parametric distribution, nonparametric estimators may be less precise than parametric estimators.
Sixth, the assumption that individuals are independent is needed to obtain valid estimates of the variance as given in appendix B. In the infectious disease setting, the assumption that individuals are independent may not hold. For example, when studying the incidence of a respiratory virus, close contact with infected individuals is likely to increase the hazard of infection. Therefore, in such settings, clusters of individuals must be identified and methods that account for clusters should be used [27], and are akin to methods for repeated events within an individual, though repeated events have a natural time ordering [19], whereas infectious disease data often have a natural geographical space ordering in addition to time ordering.
Here we have presented an example of survival data pertinent to infectious disease research and illustrated how to describe event times using graphical displays and nonparametric estimators of the survival function, the hazard function, and cumulative hazard function. The methods presented have broad applicability in infectious disease research. For instance, plots of estimates of these functions are helpful tools when attempting to describe the occurrence and timing of events.
The analysis presented begs the question does survival differ by measured factors? The methods to answer important questions about differences in the survival or hazard function while continuing to account for censoring and truncation typically build upon the methods presented here. For instance, the log-rank test [12] compares the survival function for two or more groups, whereas the Cox proportional hazards regression model [11] compares the hazard function for two or more groups with or without adjustment for concomitant variables. In conclusion, knowledge of the methods presented is central to an understanding of the survival analysis methods used in clinical research generally, and infectious disease research in particular.
Acknowledgement
The present research was supported in part by National Institutes of Health grant P30 AI50410.
References
1. Cox D, Oakes D. Analysis of survival data. London: Chapman and Hall; 1984.
2. Kalbfleisch JD, Prentice RL. The statistical analysis of failure time data. New York: Wiley; 1980.
3. Klein JP, Moeschberger ML. Survival analysis: techniques for censored and truncated data. 2nd ed. New York: Springer; 2003.
4. Allison PD. Survival analysis using the SAS system: a practical guide. Cary, N.C: SAS Institute; 1995.
5. Altman DG, Bland JM. Time to event (survival) data. BMJ 1998; 317:468–469.
6. Clark TG, Bradburn MJ, Love SB, Altman DG. Survival analysis part I: basic concepts and first analyses. Br J Cancer 2003; 89:232–238.
7. Kaslow RA, Ostrow DG, Detels R, Phair JP, Polk BF, Rinaldo CR. The Multicenter AIDS Cohort study: rationale, organization, and selected characteristics of the participants. Am J Epidemiol 1987; 126:310–318.
8. Suissa S. Immortal time bias in pharmaco-epidemiology. Am J Epidemiol 2008; 167:492–499.
9. Lash TL, Cole SR. Immortal person-time in studies of cancer outcomes. J Clin Oncol 2009;27:e55–e56.
10. Cox C, Chu H, Schneider MF, Munoz A. Parametric survival analysis and taxonomy of hazard functions for the generalized gamma distribution. Stat Med 2007; 26:4352–4374.
11. Cox DR. Regression models and life tables. J R Statist Soc (B) 1972; 34:187–220.
12. Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. JASA 1958; 53:457–481.
13. Lamarca R, Alonso J, Gomez G, Muñoz A. Left-truncated data with age as time scale: an alternative for survival analysis in the elderly population. J Gerontol A Biol Sci Med Sci 1998; 53:M337–343.
14. Efron B. The two sample problem with censored data. In: 5th Annual Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press; 1967. pp. 831–853.
15. Turnbull BW. The empirical distribution function with arbitrarily grouped, censored and truncated data. J Roy Statist Soc B 1976; 38:290–295.
16. Schneider MF, Gange SJ, Williams CM, Anastos K, Greenblatt RM, Kingsley L, et al. Patterns of the hazard of death after AIDS through the evolution of antiretroviral therapy: 1984–2004. AIDS 2005; 19:2009–2018.
17. Ramlau-Hansen H. Smoothing counting process intensities by means of kernel functions. Ann Stat 1983; 11:453–466.
18. Wang MC, Brookmeyer R, Jewell NP. Statistical models for prevalent cohort data. Biometrics 1993; 49:1–11.
19. Cain LE, Cole SR, Chmiel JS, Margolick JB, Rinaldo CR Jr, Detels R. Effect of highly active antiretroviral therapy on multiple AIDS-defining illnesses among male HIV seroconverters. Am J Epidemiol 2006; 163:310–315.
20. Scharfstein D, Robins JM, Eddings W, Rotnitzky A. Inference in randomized studies with informative censoring and discrete time to event endpoints. Biometrics 2001; 57:404–413.
21. Williamson JM, Satten GA, Hanson JA, Weinstock H, Datta S. Analysis of dynamic cohort data. Am J Epidemiol 2001; 154:366–372.
22. Sun J. The statistical analysis of interval-censored failure time data. Ney York: Springer; 2006.
23. Law CG, Brookmeyer R. Effects of mid-point imputation on the analysis of doubly censored data. Stat Med 1992; 11:1569–1578.
24. Prentice RL, Kalbfleisch JD, Peterson AV, Flournoy N, Farewell VT, Breslow NE. The analysis of failure times in the presence of competing risks. Biometrics 1978; 34:541–554.
25. Fine JP, Gray RJ. A proportional hazards model for the subdistribution of a competing risk. JASA 1999; 94:496–509.
26. Lau B, Cole SR, Gange SJ. Competing risk regression models for epidemiologic data. Am J Epidemiol 2009; 170:244–256.
27. Wei LJ, Lin DY, Weissfeld L. Regression analysis of multivariate incomplete failure time data by modeling marginal distributions. JASA 1989; 84:1065–1078.
28. Nelson W. Theory and applications of hazard plotting for censored failure data. Technometrics 1972; 14:945–965.
29. Aalen OO. Nonparametric inference for a family of counting processes. Ann Stat 1978; 6:701–726.
30. Therneau TM, Grambsch PM. Modeling survival data: extending the Cox model. New York: Springer; 2000.
31. Tsai WY, Jewell NP, Wang MC. The product limit estimate of a survival curve under right censoring and left truncation. Biometrika 1987; 74:883–886.
Appendix A: formal definitions
The survival function is defined as S(t) = P(T>t), where P(•) is the probability of •. Because the survival function is a probability, it has bounds of S(0) = 1 and S(∞) = 0. S(t) is the complement of the cumulative distribution function F(t), which is the probability that T is less than or equal t, that is, F(t) =P(T ≤ t) = 1−S(t).
The hazard is defined as h(t) = f(t)/S(t), where f(t) is the probability density function, or the slope of the cumulative distribution function F(t) at time t, f(t) = dF(t)/dt. The cumulative hazard function is defined as
where the definite integral is taken from 0 to t.
Observed event times Ti are ranked as R1 < R2 <… <RD′, where D′ is the number of distinct (untied) event times. In our HIV example,
if there are no tied event times, otherwise
For k = 1,…,D′, let Yk be the number of individuals who died at each of the ranked times Rk, or formally
where 1(•) is the indicator function, such that it equals 1 if • is true and 0 otherwise.
Let Nk be the number of individuals at-risk for mortality while under observation at distinct ranked event time Rk, for k = 1,…,D′. Nk is the size of the ‘risk set’ at time Rk and is defined as
Note individuals are immortal for any events between the origin at time 0 and entry at time Wi and are, therefore, not included in the risk set Rk if Rk ≤ Wi. However, individuals who are censored exactly coincident with the event at time Rk (i.e., Ti = Rk and Di = 0) are considered to be at risk at time Rk, and, therefore, are included in the risk set Nk.
A nonparametric estimator of S(t) is the product-limit or Kaplan–Meier estimator [12], which is defined as
where at time t, the product is taken over all ordered events up to time t, or {k:Rk ≤ t}. The hazard h(t) at the distinct event time t = Rk can be estimated by
where Δk = Rk−Rk−1 and R0 = 0. Finally, the cumulative hazard is estimated simply as
An alternative nonparametric estimator of the survival function was given independently by Nelson [28]and Aalen [29] as
is the Nelson–Aalen estimator of the cumulative hazard function. Both the Kaplan–Meier and Nelson–Aalen estimators of the survival and cumulative hazard functions are nonparametric, consistent (i.e., converge in probability to the true value as the sample size tends to infinity), asymptotically normal and asymptotically equivalent [30]. These two approaches may differ in small sample sizes or when there are many ties. The Kaplan–Meier estimator is more commonly used in the biomedical literature.
Appendix B: variance estimators
The most commonly used estimator of the variance for SKM(t) is
which is attributed to Greenwood [3,31]. Approximate point-wise 95% confidence limits can be calculated by
A formula for the variance for SNA(t) is
which was given by Aalen [29]. A formula for the variance for HKM(t) is
which is obtained by the delta method. Alternative variance estimators exist [3].
Keywords:
censoring; cohort studies; survival analysis; time-to-event; truncation
© 2010 Lippincott Williams & Wilkins, Inc.
No hay comentarios:
Publicar un comentario