The Chi-Square Test: Checking Fit Between Data and Models

A casino manager suspects one of her roulette wheels is biased — landing on certain numbers more than others. She records 3,800 spins. Each of the 38 numbers should appear about 100 times if the wheel is fair. Some numbers appeared 115 times, others only 83. Is the wheel rigged, or is this just the normal variation you'd expect from random chance? This is exactly the question the chi-square test was designed to answer: is the gap between what you observed and what you expected due to chance, or is something systematically different?

The Chi-Square Statistic

The chi-square test measures the total discrepancy between observed and expected counts. For each category, compute the difference between observed and expected counts, square it (to make all differences positive), and divide by the expected count (to normalize — a deviation of 10 from an expected 100 is less surprising than a deviation of 10 from an expected 15). Sum across all categories.

χ² = Σ (Observed - Expected)² / Expected For the roulette wheel with 38 numbers: χ² = (115-100)²/100 + (83-100)²/100 + ... (38 terms) = 225/100 + 289/100 + ... = 2.25 + 2.89 + ... A small χ² \to observed data matches expected well (fair wheel) A large χ² \to data departs significantly from expected (possibly biased)

The squaring step is important: it means large deviations contribute disproportionately more than small ones, making the test sensitive to big discrepancies anywhere in the data.

Is the Value "Big Enough" to Be Suspicious?

The chi-square statistic follows a known probability distribution — the chi-square distribution — when the null hypothesis (fair wheel) is true. The shape of this distribution depends on one number: the degrees of freedom, which equals the number of categories minus one. For a roulette wheel with 38 numbers, degrees of freedom = 37.

Degrees of freedom = (number of categories - 1) = 37 For 37 degrees of freedom: χ² > 52.2 → p-value < 0.05 (suspicious at 5% significance level) χ² > 62.9 → p-value < 0.01 (very suspicious) The p-value answers: if the wheel were truly fair, what fraction of the time would we see a χ² this large by chance? p < 0.05 → less than 5% chance → conclude the wheel is biased.

For the casino manager's data, suppose χ² = 58.3. The p-value would be about 0.013 — only a 1.3% chance of seeing this much variation from a fair wheel. She should investigate the wheel.

Testing Independence

The chi-square test has a second major application: testing whether two categorical variables are independent of each other. Are smoking and lung cancer independent? Are gender and voting preference independent? You construct a table of counts (a contingency table), compute expected counts under independence, and apply the same χ² formula. A large χ² indicates the variables are not independent — they're associated.

Independence test (contingency table): Expected count in cell (i,j) = (row i total \times column j total) / grand total χ² = Σ over all cells: (Observed - Expected)² / Expected Degrees of freedom = (rows - 1) \times (columns - 1)

Mendel's Peas — And a Puzzle

Gregor Mendel's famous genetics experiments produced ratios of plant traits that matched his predicted 3:1 ratios remarkably well. A century later, statisticians applied the chi-square test to his original data and found something suspicious: his results fit his theory too well — better than chance would predict even if the theory is correct. Mendel's chi-square values were consistently too small, suggesting the data may have been selectively recorded or adjusted. The chi-square test detected potential manipulation in 19th-century data using nothing more than the expected-versus-observed comparison.

Conclusion

The chi-square test answers one question: is the gap between observed data and expected values large enough to be suspicious, or within the range of normal random variation? By computing χ² = Σ(O-E)²/E and comparing to a known distribution, it provides a p-value — the probability that random chance alone produced data this discrepant. Applied to the roulette wheel, it catches a biased game. Applied to genetics data, it caught suspicious data cleanliness. Applied to medical or social research, it tests whether two variables are related. It's one of the most widely used statistical tests in science precisely because the question it answers arises in so many contexts.