# Load Packages
library(tidyverse) |> suppressPackageStartupMessages()
library(knitr)
library(survival)
library(ggsurvfit)
library(gt)Introduction to Survival Analysis
Modeling Medical Time-to-Event Data Using R Software Packages
1 Survival Analysis
1.1 Survival Analysis Background
Survival analysis is a type of statistical analysis used for analyzing the relationship between time and an event of interest occurring for individuals. In medical applications, the most common event analyzed is death, but many other events can be analyzed such as time it takes to begin recovering from treatment or time until a disease is contracted. These analyses are specifically helpful when comparing groups of people such as treatment groups in a clinical trial. Survival analysis can reveal whether certain treatments are more effective than others in helping individuals to live longer or avoid certain outcomes of interest, such as heart attacks.
Survival analysis can be used for any time-to-event data, not just medical data. Some of these disciplines include: Epidemiology, Finance, Engineering, Marketing, Insurance, and more. For example, in marketing, survival analysis can be used to predict how long a customer will remain a customer. In epidemiology, it can be used for predicting time until disease recurrence. In engineering, survival analysis can be used to predict how long a machine will last. We will focus on medical applications in this paper. Let’s begin by creating a simple data set.
1.2 Example data set in R
To demonstrate survival analysis, an example data set was created including ten observations with the following columns:
- \(id\): number of each observation ranging from 1 to 10
- \(time\): time variable ranging from 1 to 10
- \(status\): 1 if the event occurs at that time for that individual, 0 if no event occurs
For those assigned a 1, a survival time was assigned in the range of 1 to 9, representing the time that individual survived until the event occurred for them. For those assigned a 0, a time of \(t = 10\) was assigned, suggesting that they lasted until the end of the hypothetical study without the event occurring. Table 1.1 shows these data.
# Create a simple data set
id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
time <- c(1, 2, 4, 5, 6, 8, 9, 10, 10, 10)
status <- c(1, 1, 1, 1, 1, 1, 1, 0, 0, 0)
surv <- data.frame(id, time, status)surv %>% gt(caption = "Example Data Set with ID, Status, and Time") %>%
cols_label(id = "ID", time = "Time", status = "Status") | ID | Time | Status |
|---|---|---|
| 1 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 4 | 1 |
| 4 | 5 | 1 |
| 5 | 6 | 1 |
| 6 | 8 | 1 |
| 7 | 9 | 1 |
| 8 | 10 | 0 |
| 9 | 10 | 0 |
| 10 | 10 | 0 |
To perform survival analysis on these data, the probability of survival at each time from \(t = 0\) to \(t = 10\) can be calculated by dividing the amount of people surviving until that time by the total amount of people in the study. For the first observation in the example data, probability of survival is equal to 1, or 100%, because every participant would presumably begin the study alive. To calculate the probability of survival for time \(t = 1\), the number of participants that did not experience the outcome of interest by \(t = 1\) would be divided by 10, the total number of observations in the study. In this case, 9 participants survived until time \(t = 1\), because only one observation was assigned a 1 between times \(t = 0\) and \(t = 1\). Dividing this total by the 10 total observations in the study shows that the probability of surviving to time 1 is 0.9, or 90%. As time increases, the total probability of survival for the group will decrease in the range of 1 to 0 because more people will be experiencing the outcome. This is why survival curves typically have a rightly skewed distribution. Table 1.2 shows the survival probabilities for each observation.
# Calculate survival probabilities
probabilities <- data.frame(
Number_alive = c(10, 9, 8, 8, 7, 6, 5, 4, 3, 3, 3),
Time = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
Probability = c(1, 0.9, 0.8, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.3, 0.3))probabilities %>% gt(caption = "Probability of Survival Based on Time") %>%
cols_label(Number_alive = "Number alive",
Time = "Time", Probability = "Probability") | Number alive | Time | Probability |
|---|---|---|
| 10 | 0 | 1.0 |
| 9 | 1 | 0.9 |
| 8 | 2 | 0.8 |
| 8 | 3 | 0.8 |
| 7 | 4 | 0.7 |
| 6 | 5 | 0.6 |
| 5 | 6 | 0.5 |
| 4 | 7 | 0.4 |
| 3 | 8 | 0.3 |
| 3 | 9 | 0.3 |
| 3 | 10 | 0.3 |
1.3 Censoring Background
One issue with survival analysis data is the problem of an individual being lost to follow-up, meaning data could not be collected for them at some point during the study. This causes the outcome of interest to not be recorded for that individual, so the survival time for that individual can not be analyzed as it would not be accurate. When this happens, the survival times for those individuals are recorded as censored, causing standard analysis techniques to be inappropriate for these data. There are three types of censoring that can occur. The first type of censoring is called right censoring, which happens if an individual begins the study but then is lost to follow-up at some point during the study before it ends. The censored survival time for these individuals is thus equal to the total time they were known to be alive in the study’s observation period before they were lost to follow-up. Some common examples of when right censoring is needed include: an individual moving away from the study and not being able to participate or when an individual dies due to a non-related event after the study begins. Another type of censoring is called left censoring. In this type, the individual experiences lost to follow-up before the observation period begins. This would happen, for example, in a study that tracks patient recovery from a surgery, but with an observation period beginning one a month after the surgery took place. If a patient died less than a month after surgery, their survival time would need to be left censored. The final type of censoring is called interval censoring, which happens when a patient comes in and out of the study, making it possible for them to experience the outcome of interest during a period of time when they aren’t being observed. This often happens when recurrence is being tracked in a study. One example could be if recurrence of cancer is being tracked and the study checks in with patients every month. If an individual does not have cancer after the first month but then does after the second month, the recurrence time is somewhere between one and two months, and it therefore needs to be interval censored (Collet (2003)). Out of the three types, right censoring is the most common and will be demonstrated in the next example data-set used in section 2.3.
1.4 Paper Outline
The rest of the paper will be divided into the following sections:
- Kaplan Meier Survival Curves: This section will introduce how to model survival probabilities using the Kaplan Meier survival estimate. It will walk through how to plot survival curves in R, how to run a Log-Rank test for difference in survival curves, and then two sets of case study data.
- Cox Proportional Hazards Regression: This section will introduce modeling survival hazard rates using Cox Proportional Hazards Regression. It will walk through how to run a Cox Proportional Hazards Regression in R, how to interpret the output, and then applying the ideas to the same two case studies.
- Discussion: This section will summarize the paper findings.