Maximum Likelihood Estimation: Fitting Models to Data

A quality control engineer at a battery factory wants to know the average lifespan of their product. She can't test every battery — the factory produces millions. So she tests 200 batteries and records when each one dies. From those 200 numbers, she needs to estimate the true average lifespan of all batteries the factory will ever produce. Maximum likelihood estimation is the mathematical method that gives her the best possible estimate — and tells her exactly why it's the best.

The Core Idea: Most Likely Explains the Data

Instead of asking "what is the true parameter?" — a question we can't answer directly — maximum likelihood asks: "given the data we observed, which parameter value would make this data most likely to occur?" The best estimate is the parameter that maximizes the probability of seeing exactly what we saw.

For the battery example: if battery lifespans follow an exponential distribution (a common model for failure times), that distribution has one parameter — the average lifespan μ. Some values of μ make our observed data very probable; others make it nearly impossible. MLE finds the μ that makes the observed data most probable.

The Likelihood Function

The likelihood function L(μ) measures the probability of observing our entire dataset if the true average lifespan were μ. If the 200 observed lifespans are x₁, x₂, ..., x₂₀₀, and each follows the exponential distribution, the probability of seeing all 200 values is the product of their individual probabilities:

Likelihood: L(μ) = P(x₁|μ) \times P(x₂|μ) \times ... \times P(x₂₀₀|μ) For exponential distribution: P(xᵢ|μ) = (1/μ)\cdote^(-xᵢ/μ) In plain English: L(μ) is high when the chosen μ makes all the observed lifespans feel "expected." It's low when μ makes the observed data seem like a bizarre coincidence.

In practice, we maximize the log of the likelihood — taking the log turns the product into a sum, which is much easier to work with mathematically. Since log is a monotonically increasing function, the μ that maximizes log L(μ) is the same μ that maximizes L(μ).

The Answer: Just the Average

For exponential lifespans, taking the derivative of log L(μ) with respect to μ, setting it to zero, and solving gives a beautifully simple answer: the MLE estimate of μ is the sample mean — the average of all 200 observed lifespans. If the batteries averaged 480 hours in the test, the MLE estimate is 480 hours. This is not obvious — it's a mathematical result that happens to be intuitive. For other distributions, MLE gives more complex formulas.

MLE for exponential distribution: μ̂ = (x₁ + x₂ + ... + xₙ) / n = sample mean For our engineer: (sum of 200 battery lifespans) / 200

Why MLE Is the Right Method

MLE has three properties that make it the gold standard for estimation. It's consistent: as sample size increases, the MLE converges to the true parameter. It's efficient: among all unbiased estimators, MLE achieves the smallest possible variance for large samples — it extracts the most information from the data. And it's invariant: if you find the MLE of μ, then g(μ̂) is automatically the MLE of g(μ) — a convenient property when transforming between parameterizations.

Other Applications

MLE is the estimation method behind logistic regression (predicting whether an email is spam), survival analysis (modeling time to failure in medical studies), phylogenetic tree reconstruction (finding the most likely evolutionary history from DNA sequences), and speech recognition (fitting acoustic models to recorded speech). Whenever you need to fit a probabilistic model to observed data, MLE is typically the first method to reach for.

Conclusion

Maximum likelihood estimation answers the question "which model parameters best explain what we observed?" by finding the values that make the observed data most probable. For the battery engineer, it confirms that the sample mean is the right estimate — not by intuition, but by mathematical proof. For more complex models, MLE provides equally principled answers. It's the foundation of much of modern statistics: a systematic way to extract the most information possible from limited data.