Reliability Engineering: Predicting When Things Fail

On January 28, 1986, the Space Shuttle Challenger broke apart 73 seconds after launch, killing all seven crew members. The cause: an O-ring seal failed in the cold weather. NASA engineers had data showing O-rings behaved poorly at low temperatures — but they didn't have a mathematical framework for translating that data into a probability of failure on that specific cold morning. Reliability engineering provides that framework: the mathematics of predicting when components will fail and what that means for systems that depend on them.

The Bathtub Curve

Most manufactured components follow a characteristic failure rate pattern over their lifetime called the bathtub curve. Early in life, failure rates are high — "infant mortality" failures from manufacturing defects. These failures weed out quickly. In the middle period, failure rates are low and roughly constant — normal useful life. Late in life, failure rates rise again as components wear out. Understanding which phase a component is in determines the appropriate maintenance and replacement strategy.

Failure rate λ(t) over time: Early life: λ(t) high (manufacturing defects surfacing) Middle life: λ(t) \approx constant = λ (random failures) End of life: λ(t) rising (wear-out failures) During middle life (constant λ): Reliability R(t) = P(component survives past time t) = e^(-λt) Mean Time Between Failures (MTBF) = 1/λ

The exponential reliability function R(t) = e^(-λt) describes the probability of surviving past time t during the useful life phase. If λ = 0.001 failures per hour (MTBF = 1000 hours), then after 500 hours: R(500) = e^(-0.5) ≈ 0.61. There's a 61% chance the component is still working — and a 39% chance it's already failed. Engineers use this to schedule preventive maintenance before the failure probability becomes unacceptably high.

System Reliability: Series and Parallel

Real systems combine many components. How their individual reliabilities combine into system reliability depends on the architecture.

Series system (all components must work): R_system = R₁ \times R₂ \times R₃ \times ... (weakest link: any failure kills the system) Parallel system (at least one must work): R_system = 1 - (1-R₁)(1-R₂)(1-R₃)... (redundancy: system only fails if ALL components fail)

A Space Shuttle had thousands of components in series — any single critical failure could be catastrophic. The O-ring seal was one such component. If its reliability at cold temperatures was 0.90 (a 10% chance of failure), and the shuttle required it to work, the system reliability included that 10% failure probability directly.

The Challenger Lesson

The data on O-ring damage at low temperatures existed. A simple reliability analysis would have shown a clear trend: O-ring damage incidents clustered at low launch temperatures, and none occurred at high temperatures. A logistic regression model — a standard statistical tool in reliability analysis — would have predicted a dramatically elevated probability of O-ring failure at 29°F, the temperature on launch morning. The mathematical model that wasn't built might have stopped the launch.

Fault Tree Analysis

Modern reliability engineering uses fault tree analysis: start with an undesired event (system failure) and work backward through all possible causes, building a tree of "AND" gates (all sub-causes must occur simultaneously) and "OR" gates (any one sub-cause is sufficient). The probability of the top event is computed by combining the component failure probabilities through the tree. This visual, mathematical method is now required for all safety-critical aerospace and nuclear systems — a direct legacy of accident investigations like Challenger.

Conclusion

Reliability engineering translates component failure data into probabilities, then combines those probabilities to predict system behavior. The bathtub curve describes how failure rates evolve over a component's lifetime. Exponential reliability functions quantify survival probability. Series and parallel reliability formulas propagate component probabilities to system level. Fault tree analysis maps how failures combine into catastrophic events. Challenger showed what happens when this mathematics isn't applied. Modern aerospace, nuclear, and medical device engineering applies it rigorously — predicting failure before it happens rather than investigating it after.