Survival analysis is a technique that is widely used in research and industry as studies or problems involving waiting time to an event(s) become frequent. Engineers might prefer to use the term reliability analysis while economists might prefer duration analysis or duration modeling for example.

In this article we begin a simple case where interest is in describing the survival experience of a set of cases without the intention of linking the time to event to explanatory variables. The extension, at least the programming level, by including explanatory variables is straightforward.

In this simple case of survival analysis one can either choose between a non-parametric and a parametric method to analyze the data. In a non-parametric case, as opposed to a parametric case, no statistical distributional assumptions are made about the time to event. The common non-parametric method for describing survival time is the Kaplan-Meier (KM) estimator.

The strength of parametric survival analysis, over a non-parametric method, is that the estimated survival curve is smoother because they draw information from the whole data. It is also possible and easier with a parametric approach to perform sophisticated analyses such as adding random effects and using Bayesian methods to pull sources of information.

One challenge that individuals have in survival analysis is to choose between the available parametric survival time distributions. We show how you can do this within the statistical programming language R. Conventionally, the choice is made by fitting candidate distributions and comparing their fit to the data by using some measure of fit plus a graphical assessment of the fit. Here we use a measure of fit referred to as the Akaike information Criterion (AIC), the lower the AIC the better the fit. It is important to note that the AIC allows us to assess relative model goodness of fit, but not absolute model goodness of fit. Just because the model A fits better than model B based on the AIC does not necessarily mean that model A adequately describes the data. For this reason we need to use a graphical method in addition.

There are several packages in R the can be used for survival analysis, we have picked the flexsurv package as it provides the AIC as part of its results, and readily plots survival curve from the KM-estimator overlaid on the selected parametric model which comes in handy for the graphical assessment.

We will explore the exponential, Weibull, log-normal, gamma, gompertz and finally the generalized F distribution on the lung data; this is a dataset in R on the survival in patients with advanced lung cancer from the North Central Cancer Treatment Group.

Let’s start with exponential model, we first load the flexsurv library into R workspace and create the survival object that entails the time to death (in days) and the status (dead or alive) at the end of the study. This object will be used across all models/survival time distributions.

library(flexsurv)
surv<-with(lung, Surv(time, status))

We then fit the exponential model, display the estimates, and plot the fit.

model<-flexsurvreg(surv ~ 1, dist="exp") # fit the exponential model
model

#output
Call:
flexsurvreg(formula = surv ~ 1, dist = "exp")
Estimates:
     est     L95%     U95% 
rate 0.00237 0.00204 0.00276

N = 228, Events: 165, Censored: 63
Total time at risk: 69593
Log-likelihood = -1162.338, df = 1
AIC = 2326.676

#plot
plot(model, ylab="Survival probability", xlab="Time")
legend("topright",legend=c("KM Plot","Fitted"), lty=c(1,1),col=c("black","red"), cex=0.5)

The exponential model has an AIC of 2326.68. The survival plot shows that it would underestimate survival before 400 days and overestimate it above this time (Figure 1).

survival analysis exponential model

Figure 1

 

Let’s fit the Weibull model:

model2<-flexsurvreg(surv ~ 1, dist="weibull") # fit weibull model
model2

#output
Call:
flexsurvreg(formula = surv ~ 1, dist = "weibull")
Estimates:
       est     L95%   U95%
shape   1.32   1.17   1.49
scale 418.00 372.00 469.00

N = 228, Events: 165, Censored: 63
Total time at risk: 69593
Log-likelihood = -1153.851, df = 2
AIC = 2311.702

#plot
plot(model2, ylab="Survival probability", xlab="Time")
legend("topright",legend=c("KM Plot","Fitted"), lty=c(1,1),col=c("black","red"), cex=0.5)

The Weibull model has an AIC of 2311.70. The survival plot shows a very close fit to the observed survival experience (Figure 2).

survival analysis weibull model

Figure 2

 

How about the log-normal model? Here we go:

model3<-flexsurvreg(surv ~ 1, dist="lnorm") # fit the log-normal model
model3

#output
Call:
flexsurvreg(formula = surv ~ 1, dist = "lnorm")
Estimates:
         est   L95%   U95%
meanlog 5.660 5.510 5.820
sdlog   1.100 0.983 1.230

N = 228, Events: 165, Censored: 63
Total time at risk: 69593
Log-likelihood = -1169.269, df = 2
AIC = 2342.538

#plot
plot(model3, ylab="Survival probability", xlab="Time")
legend("topright",legend=c("KM Plot","Fitted"), lty=c(1,1),col=c("black","red"), cex=0.5)

The AIC for the log-normal model is 2342.54, so far higher than the value for the exponential and the Weibull models. The survival plot (Figure 3) shows a poor performance especially after 400 days.

survival analysis log normal model

Figure 3

 

Let’s now test the gamma model for the survival time distribution:

model4<-flexsurvreg(surv ~ 1, dist="gamma") # fit the gamma model
model4

#output
Call:
flexsurvreg(formula = surv ~ 1, dist = "gamma")
Estimates:
       est     L95%     U95%
shape 1.48000 1.23000 1.78000
rate   0.00376 0.00293 0.00482

N = 228, Events: 165, Censored: 63
Total time at risk: 69593
Log-likelihood = -1154.735, df = 2
AIC = 2313.469

#plot
plot(model4, ylab="Survival probability", xlab="Time")
legend("topright",legend=c("KM Plot","Fitted"), lty=c(1,1),col=c("black","red"), cex=0.5)

The AIC is 2313.47. The survival plot (Figure 4) shows a close fit, rivaling that observed with Weibull model albeit with a higher AIC compared to 2311.70 under the Weibull model.

survival analysis gamma model

Figure 4

 

How will the gompertz model perform? Let’s see.

model5<-flexsurvreg(surv ~ 1, dist="gompertz") # fit the gompertz model
model5

#output
Call:
flexsurvreg(formula = surv ~ 1, dist = "gompertz")
Estimates:
       est       L95%     U95% 
shape 0.001390 0.000724 0.002050
rate   0.001670 0.001310 0.002130

N = 228, Events: 165, Censored: 63
Total time at risk: 69593
Log-likelihood = -1155.355, df = 2
AIC = 2314.711

#plot
plot(model5, ylab="Survival probability", xlab="Time")
legend("topright",legend=c("KM Plot","Fitted"), lty=c(1,1),col=c("black","red"), cex=0.5)

Apart from the underestimation of survival in the first 200 days, the Gompertz model fits almost as good as the Weibull model for the rest of the observation window. It has an AIC of 2314.71, slightly higher than the Weibull one and lower than the rest of the models explored so far, except the gamma model.

survival analysis gompertz model

Figure 5

 

Finally we explore the generalized F distribution. This is an umbrella distribution used in survival analysis and models such as the generalized gamma and log-logistic are its special cases.

model6<-flexsurvreg(surv ~ 1, dist="genf") # fit the GenF model
model6

#output
Call:
flexsurvreg(formula = surv ~ 1, dist = "genf")
Estimates:
         est       L95%     U95%
mu      6.04000   5.81000   6.27000
sigma   0.68300   0.48200   0.96700
Q       1.02000   0.40000   1.64000
P       0.54900   0.00773   38.90000

N = 228, Events: 165, Censored: 63
Total time at risk: 69593
Log-likelihood = -1153.577, df = 4
AIC = 2315.153

#plot
plot(model6, ylab="Survival probability", xlab="Time")
legend("topright",legend=c("KM Plot","Fitted"), lty=c(1,1),col=c("black","red"), cex=0.5)

The AIC is 2315.15, only lower relative to the exponential and log-normal models. The graphical assessment show a very close fit across the observation window, similar to that observed for the Weibull distribution. The model is however less imprecise after around 400 days, as observed by the wide confidence intervals (Figure 6).

survival analysis generalized F model

Figure 6

 

In summary, the Weibull model has the lowest AIC among those models explored the graphical assessment of it also supports the fact that it offers the closets fit to the data. It would be the model of choice. The other models that come closer are the gamma, gompertz and the generalized F respectively. As a rule of thumb, if a simpler model fits and describes the data as well as a more complex model, the simpler one should be preferred for the obvious reason of the additional advantage of ease of comprehension.

Did you find this article useful? Do you have any questions? Share your opinions below.