Hypothesis Testing: How Science Decides What's True

A new drug appears to lower blood pressure in a clinical trial. A website redesign seems to increase sales. Students taught with a new method score higher on tests. In each case, the key question is the same: is the observed effect real, or could it have arisen by chance? Hypothesis testing is the formal statistical framework for answering this question—the procedure that separates genuine effects from random noise, and the foundation of evidence-based decision making in science, medicine, and business.

The Null and Alternative Hypotheses

Every hypothesis test begins with two competing hypotheses. The null hypothesis H₀ is the skeptical default—typically that there is no effect, no difference, no relationship. The alternative hypothesis Hₐ is the claim we want to test. We never directly prove the alternative hypothesis. Instead, we ask: if the null hypothesis were true, how surprising would our data be? If sufficiently surprising, we reject the null in favor of the alternative. This asymmetry—innocent until proven guilty—protects against false positives by requiring strong evidence before accepting a claim.

The p-value

The p-value quantifies how surprising the data is under the null hypothesis. Formally, it is the probability of observing data at least as extreme as what we actually observed, assuming H₀ is true. A p-value of 0.03 means: if the null hypothesis were true, there would be only a 3% chance of seeing data this extreme or more extreme. By convention, p < 0.05 is called statistically significant. But this threshold is arbitrary, and the p-value is widely misunderstood. It is not the probability that the null hypothesis is true—it is a statement about the data given the null hypothesis.

p-value = P(data this extreme or more | H₀ is true) Reject H₀ if p < α (typically 0.05)

Type I and Type II Errors

Two types of errors are possible. A Type I error (false positive) is rejecting a true null hypothesis—concluding a drug works when it doesn't. The significance level α is the probability of a Type I error; setting α = 0.05 means accepting a 5% false positive rate. A Type II error (false negative) is failing to reject a false null hypothesis—missing a real effect. The probability of a Type II error is β; statistical power = 1 − β is the probability of correctly detecting a real effect. Larger sample sizes reduce both error types simultaneously.

Type I error (false positive): α = P(reject H₀ | H₀ true) Power = 1 - β = P(reject H₀ | H₀ false)

Common Tests

Different data types call for different tests. The t-test compares means between one or two groups—is the average blood pressure lower in the treatment group? ANOVA extends this to multiple groups. The chi-square test examines relationships between categorical variables. Correlation tests assess linear relationships between continuous variables. Non-parametric tests like the Mann-Whitney U test make fewer distributional assumptions and are used when normality cannot be assumed. Each test has specific assumptions that must be checked—applying the wrong test produces meaningless p-values.

The Replication Crisis

Despite its widespread use, hypothesis testing has contributed to a replication crisis in science. Researchers publish positive results preferentially (publication bias), run multiple tests but report only significant ones (p-hacking), and collect data until p < 0.05 appears (optional stopping). When other researchers attempt to replicate published findings, many fail. A landmark 2015 study found that only 36% of psychology findings replicated. Solutions include pre-registration, larger samples, reporting effect sizes alongside p-values, and Bayesian alternatives that directly quantify evidence.

Conclusion

Hypothesis testing provides a principled framework for drawing conclusions from noisy data. Its logic—quantifying how surprising observations are under a skeptical null—is sound when applied carefully. But the p-value threshold has been fetishized into a binary verdict on truth, leading to widespread misuse. Understanding what p-values actually measure—and what they don't—is essential for anyone who reads, conducts, or evaluates research in any empirical field.