Link Prediction: Forecasting New Connections

Twitter might suggest you follow someone you've never heard of—and be right. LinkedIn might predict a professional connection before you've exchanged a business card. Amazon recommends products bought by people with similar purchase histories. All of these involve link prediction: given the current structure of a network, which edges are likely to form next? This problem sits at the intersection of network science, machine learning, and probability, with applications from drug discovery to intelligence analysis.

Why Networks Have Missing Links

Real networks are almost always incomplete. Social networks have connections not yet formed between people who would become friends if introduced. Protein interaction networks have interactions not yet discovered experimentally. Knowledge graphs have relationships between entities that researchers haven't yet verified. Collaboration networks have potential collaborations between researchers who haven't met. Link prediction addresses both future link formation (social networks) and missing link discovery (biological networks)—the mathematical formulation is identical even though the interpretation differs.

Similarity-Based Methods

The simplest link prediction methods score each pair of non-connected nodes by a similarity measure. Common neighbors: two people likely to connect if they share many friends. Formally, the score is |N(u) ∩ N(v)|, where N(u) is the set of neighbors of u. Jaccard coefficient normalizes by union: |N(u) ∩ N(v)| / |N(u) ∪ N(v)|. The Adamic-Adar index weights common neighbors by the log-inverse of their degree—rare common neighbors count more than hubs. These methods are intuitive, computationally cheap, and surprisingly effective for social networks.

Common neighbors: |N(u) ∩ N(v)| Adamic-Adar: Σ_{w ∈ N(u)∩N(v)} 1/log|N(w)|

Path-Based Methods

Katz centrality scores potential links by the number of paths between nodes, with longer paths weighted less: score(u,v) = Σ_l β^l |paths_l(u,v)|, where β < 1 is a decay factor. The matrix form is (I − βA)^{−1} − I where A is the adjacency matrix. SimRank captures the idea that two nodes are similar if they are connected to similar nodes—a recursive definition solved iteratively. These path-based methods capture global network structure rather than just local neighborhoods, improving predictions in networks where short paths are less informative.

Katz score: score(u,v) = [(I − βA)⁻¹ − I]_{uv}

Matrix Factorization and Embeddings

Modern link prediction uses latent factor models. Matrix factorization decomposes the adjacency matrix A ≈ UV^T, where U and V are low-rank matrices of node embeddings. The predicted score for edge (u,v) is the dot product of their embedding vectors. Node2Vec and DeepWalk learn embeddings using random walks on the graph—nodes frequently co-occurring in walks get similar embeddings. Graph Neural Networks (GNNs) compute embeddings by aggregating neighborhood information through multiple layers, achieving state-of-the-art performance on benchmark link prediction tasks.

Evaluation

Evaluating link prediction requires care. Splitting networks into train/test sets temporally—using past edges to predict future ones—is the most realistic evaluation. Measuring performance with AUC (area under the ROC curve) or average precision accounts for the severe class imbalance: non-edges vastly outnumber edges in any sparse network. Random baseline performance is not 50%—it depends on network density. The right evaluation strategy depends on whether you're predicting future links (use temporal splits) or discovering missing links (use random holdout of known edges).

Conclusion

Link prediction demonstrates how network structure encodes probabilistic information about future connections. From simple heuristics counting common neighbors to deep learning models capturing complex structural patterns, the field has developed a rich toolkit. The core insight—that networks are not random but structured by principles that make some future links far more likely than others—enables useful predictions in social networks, biology, and recommendation systems, wherever connections emerge from underlying social or physical processes.