Decision Trees and Random Forests: Learning from Data

A doctor examining a patient with chest pain must decide: is this a heart attack, or indigestion? They run through a mental checklist — age, blood pressure, electrocardiogram results, whether the pain radiates to the arm — and each answer narrows the possibilities. This branching, question-by-question process is exactly how decision trees work. The difference is that a computer can learn which questions to ask and in what order directly from thousands of past cases.

What Is a Decision Tree?

A decision tree is a flowchart built from data. At each node, it asks a yes-or-no question about one variable. Based on the answer, it branches left or right. At the end of each branch — a leaf — it makes a prediction. For the chest pain example, the tree might start by asking "Is the patient over 60?" Then "Does the ECG show ST elevation?" Then "Is systolic blood pressure above 140?" After a few questions, it reaches a leaf that says "High probability of heart attack" or "Low probability."

How a Tree Gets Built

A decision tree learns by finding the question that best separates your data at each step. "Best" means the question that, when answered, creates the purest groups — groups that are most homogeneous in terms of the outcome you're predicting.

Purity is measured by a metric called the Gini impurity. A perfectly pure group (all heart attacks, or all non-heart-attacks) has Gini impurity of 0. A perfectly mixed group (50% each) has Gini impurity of 0.5. The algorithm tries every possible question on every variable and picks the one that creates the biggest drop in Gini impurity.

Gini impurity = 1 - Σ(pᵢ²) pᵢ = fraction of examples in class i Example: group with 90% heart attacks, 10% not: Gini = 1 - (0.9² + 0.1²) = 1 - (0.81 + 0.01) = 0.18 (fairly pure) Group with 50/50 split: Gini = 1 - (0.5² + 0.5²) = 1 - 0.50 = 0.50 (maximally impure)

The algorithm builds the tree greedily: at each node, pick the best question now, without worrying about later nodes. It stops splitting when groups become pure enough, or when groups become too small to split further.

The Problem: Overfitting

A single decision tree has a weakness: it can memorize the training data. Given enough splits, it creates so many specific rules that it perfectly classifies every past case — but fails on new cases because it's learned the noise in the data rather than the true patterns. A tree that asks "Is the patient's age exactly 47, and their last name starts with B?" is too specific to generalize.

Random Forests: Wisdom of the Crowd

The solution is to build hundreds of trees instead of one — a random forest. Each tree is trained on a random subset of the training data and allowed to consider only a random subset of the variables at each split. Individual trees overfit in different directions. But when you average their predictions — let them vote — their individual errors cancel out, and the collective prediction is much more accurate than any single tree.

Random Forest prediction = majority vote of N trees N = 100 to 1000 trees typical Each tree sees ~63% of training data (random sample with replacement) Each split considers \sqrt(total variables) randomly chosen variables

Other Applications

Random forests are used to predict loan defaults (which applicants are most likely to miss payments?), detect fraud (does this transaction pattern look unusual?), diagnose plant disease from leaf photographs, and predict equipment failures in manufacturing. They're popular because they handle mixed data types, don't require scaling or normalization, and naturally indicate which variables matter most — a useful feature for medical diagnosis where understanding the reasoning matters as much as the prediction.

Conclusion

Decision trees mimic the branching logic of expert reasoning — asking the most informative questions first, narrowing down possibilities with each answer. The math selects questions using Gini impurity, measuring how much each question clarifies the outcome. Random forests fix the single tree's tendency to overfit by averaging hundreds of diverse trees, turning individual mistakes into collective accuracy. From medical diagnosis to fraud detection, this combination of simple logic and statistical averaging is one of the most reliable tools in machine learning.