How long do cancer patients survive after treatment? When do machine components fail? How quickly do customers churn from a subscription service? These questions involve time-to-event data—the time until something happens—which requires specialized statistical methods. Standard regression doesn't work well because events may not have occurred yet (censored observations). Survival analysis handles this elegantly, providing tools that are standard in clinical medicine, reliability engineering, and increasingly in business analytics.
The Survival Function
The survival function S(t) = P(T > t) gives the probability of surviving (not experiencing the event) past time t. For a newly enrolled clinical trial participant, S(t) is the probability of surviving at least t years. S(0) = 1 (everyone starts event-free) and S(t) decreases toward 0 over time. The hazard function h(t) = −S'(t)/S(t) measures the instantaneous event rate given survival to time t—the rate at which events occur among those who haven't yet experienced them. The hazard completely characterizes the survival distribution.
The Kaplan-Meier Estimator
The Kaplan-Meier estimator constructs a nonparametric survival curve directly from data, handling censoring correctly. At each event time t_i, let d_i be the number of events and n_i the number at risk (still event-free and uncensored). The KM estimate updates: Ŝ(t) = Π_{t_i ≤ t} (1 − d_i/n_i). Censored observations contribute to the risk set up to their censoring time but are excluded afterward. Comparing KM curves for two treatment groups using the log-rank test is one of the most common analyses in clinical trials.
The Cox Proportional Hazards Model
The Cox model is survival analysis's workhorse regression tool. It models the hazard for individual i as: h_i(t) = h_0(t) · exp(β₁x₁ + β₂x₂ + …), where h_0(t) is an unspecified baseline hazard and x₁, x₂, … are covariates. The proportional hazards assumption means covariates scale the hazard multiplicatively—two individuals with different covariates have hazard functions that are constant multiples of each other at all times. Crucially, h_0(t) need not be specified—Cox's partial likelihood allows estimating the β coefficients without modeling the baseline hazard, making the model semiparametric.
Parametric Models
When the hazard function's shape is known or assumed, parametric survival models offer efficiency. The exponential distribution assumes constant hazard—memoryless, with h(t) = λ. The Weibull distribution allows increasing (β > 1) or decreasing (β < 1) hazard: h(t) = λβt^{β−1}. The log-normal and log-logistic distributions have non-monotone hazards, appropriate for diseases where risk peaks then declines (like postoperative mortality). Parametric models enable extrapolation beyond observed follow-up time, essential for health economic models requiring lifetime projections.
Business Applications
Survival analysis has moved beyond its medical origins. Customer churn modeling uses Cox regression to predict when customers will cancel subscriptions, with 'event' being cancellation and 'censoring' being current active customers. Time-to-purchase analysis in e-commerce models how long between a customer's first visit and first purchase. Employee retention analysis identifies covariates associated with faster departure. A/B testing of subscription features uses log-rank tests to compare churn curves between groups. The business adaptation requires reinterpreting clinical concepts—'survival' becomes 'retention,' 'censoring' becomes 'still active.'
Conclusion
Survival analysis solves a fundamental statistical challenge: how to analyze time-to-event data when many observations are incomplete. The Kaplan-Meier estimator handles censoring nonparametrically; the Cox model extends regression to survival outcomes while avoiding distributional assumptions about the baseline hazard. These tools have become indispensable in clinical research, reliability engineering, and business analytics wherever the question is not just 'did the event happen?' but 'when did it happen, and what predicts the timing?'