Kaplan-Meier Survival Analysis in Python

Holly Emblem
4 min readJan 3, 2021

Survival analysis is a relatively under-utilised range of statistical methods that are highly applicable in a range of fields including marketing analytics. Broadly speaking, survival analysis is used to analyse the expected amount of time for an event to happen.

Within the field of biostatistics, survival analysis is typically used to model time to death or another patient outcome. However, we can see how survival analysis can be used to model other times to events, such as purchase, churn or even friend request.

Up until 2010–2015, Python implementations of survival analysis methods were hard to come by, however that has now changed with the addition of the excellent lifelines Python library. Today, we’ll walk through a brief introduction why we perform survival analysis, as well as coding your own Kaplan-Meier model.

Why use survival analysis?

Suppose we work on a platform which encourages users to create content. However, we notice that some users aren’t creating content and are merely browsing. We might want to make some inferences about how long it will take users to create content. Here, the event we are interested in modelling is ‘created content’, and we will have a time variable, which we can consider as hours on platform.

So far so good, but why can’t we just take the average of hours on platform for first content creation, for all users that created content? This is actually a common ‘trap’ with modelling time to event analysts fall into. We shouldn’t just throw away data about users who haven’t made their first post, as we can surely use that information. More worryingly, throwing away with data would bias our results, as we’d be under-estimating the true time to event average.

We can consider these individuals who we don’t have a recorded event time for as right censored, that is perhaps they never experienced the event at the end of the study, or they withdrew before the end of the study. In marketing analytics, we typically think more about the first instance of right censoring; the event could happen in the future, but for the data period we have, it hasn’t happened.

Survival analysis gives us a way of using the information of right censored individuals to inform our model. It is more usable than simply averaging time to event for only those who experienced the event.

Kaplan-Meier in Python

A Kaplan-Meier estimator, so called as it was developed by Edward L. Kaplan and Paul Meier is a non-parametric method used for estimating the survival function from lifetime data. With just a few lines of code, we can implement our own Kaplan-Meier curve and review the results.

We’ll create some dummy data, using our content platform example, and walkthrough the input and results.

Kaplan-Meier Code

First, we need to import our relevant libraries. We’ll be using Numpy, Pandas and Lifelines.

As we’re using Jupyter notebooks, we’ll also include ‘%matplotlib inline’ to print our plots in Jupyter:

#Import relevant librariesimport numpy as np
import pandas as pd
from lifelines import KaplanMeierFitter
#Useful for printing plots in Jupyter
%matplotlib inline

Next we’ll create some dummy data. You can also access some sample datasets via lifelines.

#Create dummy time and event status, with 1 being event happenedte = np.array ([[1,0],[1,0],[1,1],[3,1],[3,1],[4,0],[4,1],[4,1],[5,1],[5,1],[6,1],[6,0],[7,1],[7,1],[7,1],[8,1],[8,1],[9,1],[9,0],[10,1],[10,1],[14,1],[14,1],[14,1],[21,0],[21,1],[21,1],[20,0],[20,1],[18,1],[18,1],[25,1],[25,0],[25,1],[25,1],[26,1],[26,1],[27,1],[23,1],[23,1]])#1 is posted, 0 is did not post (1 = event happened)#For legibility, move to Pandas dataframedf = pd.DataFrame({‘T’: te[:, 0], ‘E’: te[:, 1]})#Pull out time and event dataT = df[‘T’]E = df[‘E’]

We’ll then call the Kaplan-Meier Fitter and fit the model to our variables:

#Call fitter and fit with time and event datakmf = KaplanMeierFitter()
kmf.fit(T, E)

We’ll also create our survival function and cumulative density properties, as well as plotting them:

#Create survival function, cumultative density data, plot visualisationskmf.survival_function_
kmf.cumulative_density_
kmf.plot_survival_function()
#Plot cumultative density
kmf.plot_cumulative_density()
Survival function
CDF

We can also interpret these as Pandas dataframes:

#View cumulative density as a Pandas dataframe
kmf.cumulative_density_.head()

Survival Function and Cumulative Density Function (CDF)

In this post we haven’t delved too deeply into the inner machinations of survival analysis, but it is important to summarise what these properties offer us. The survival function gives us the probability that the event has not occurred by duration t. In our example it is the probability the user has not posted (with 1 being posted, 0 not posted).

The cumulative density is the complement of the survival function and tells us the probability that the event has occurred by time t. Reading our plots, we can see using cdf, that by 10 hours there is a 40% probability that the event (posting) has occurred. Of course, this is a range — as shown with the thick intervals of the blue lines. The complement of this, the survival function, shows us there is a 60% probability that the event has not occurred by 10 hours.

Taking Survival Analysis Further

This is quite a simple introduction to survival analysis, and as with any topic there is much more to discover. In particular, you might be interested in regression methods, which help understand how different variables can impact survival rates.

This post originally appeared at ApCoder.com, which offers Python, data science and analytics code tutorials and tips.

Resources

https://tinyheero.github.io/2016/05/12/survival-analysis.html

https://data.princeton.edu/wws509/notes/c7s1

https://www.statsdirect.com/help/survival_analysis/kaplan_meier.htm

Image: https://www.flickr.com/photos/paul2807/23204737370/

--

--

Holly Emblem

Head of Insights at Rare, a Xbox Game Studio. Previous experience as a data scientist and lead. Interested in deep learning, quantum computing and statistics.