How AF2 Thinks

Jan. 2026

Since the 1960s, biologists have been haunted by an infuriating question: how do you predict what a protein looks like just from its amino acid sequences? In December 2020, AlphaFold2 (mostly) solved this problem and, in doing so, rewrote the computational rulebook of drug discovery.

ChatGPT revolutionized language and search by turning words into vectors that machines understand. AF2 is its biological equivalent. With AF2 and AF3, you can design a new protein, predict how drugs work, spot dangerous side effects, and engineer sustainable biomaterials—all from your laptop.

But there's a catch. Letters are easy; they're one-dimensional and lie flat on a page. Amino acids, on the other hand, are three-dimensional troublemakers. They twist, rotate, crash into each other, and form bonds like kids in a chaotic soccer match. So, how do we tackle this 3D chaos?

How do we turn a protein into something a computer understands?

This is a question I've been thinking about for the past month. Early explanations from mentors in the lab and in class helped me build some understanding, but they also motivated me to dig deeper into the details. While there are many excellent articles and papers on this topic (linked below in references), my goal here is to understand the architecture by explaining it in a way that's accessible to a non-technical audience without losing underlying ideas.

The Big Picture

There are two main stages in AF2: Evoformer and a structure module. Shown below is a summary of the entire process (Jumper et. al, 2020). Both use multiple sequence alignment (MSA) and pairwise representations as input. Think of MSA as a way to retrieve genetic and evolutionary info, while pairwise representations capture structural relationships.

AlphaFold2 architecture diagram

Part 1: EvoFormer

The Evoformer is the brain of AF2: it processes evolutionary and structural info with attention (equivalent to ChatGPT's Transformer).

How it thinks in two directions at once:

Dual Attention Mechanism:

Imagine a spreadsheet where rows are different protein sequences from various species, and columns are positions along the protein chain. The Evoformer reads this spreadsheet by alternating between:

  • Horizontal reading (row attention): captures relationships across different sequences, aka the evolutionary patterns
  • Vertical reading (column attention): captures relationships across positions in the protein chain, aka the structural patterns

This dual approach is quite clever: the model integrates both genomic searches (finding similar proteins in nature) and structural template searches (finding known structures) into a unified representation, shown below (ML4FG slides).

MSA + template search

You might notice the additional bias term in the attention equation. This carries prior structural knowledge—geometric constraints from physics, a cheat sheet on physically possible distances and angles between residues.

Pairwise Feature Updates:

While the MSA representation tracks evolutionary patterns, the pairwise representation tracks geometric relationships between every pair of amino acids. The Evoformer constantly updates both representations, checking for consistency. If the evolutionary clues suggest two amino acids should be close together, but the geometric features disagree, the model applies additional refinement steps.

In other words, AF2 is a professional overthinker—the model is always double-checking itself, ensuring that evolutionary and structural signals match up.

Part 2: The Structure Module

After 48 EvoFormer blocks, the Structure Module takes over. This is where abstract understanding becomes concrete 3D atomic coordinates (actual X, Y, and Z positions).

Now here's where things get interesting: it uses geometry you learned in elementary-school.

Bag of Triangles

Remember the triangular inequality from math class? The one that says the longest side of a triangle can't be longer than the other two sides combined? (ML4FG slides)

triangular inequality

AF2 uses this simple (soft-coded) rule to build entire proteins.

Pick any three amino acids in your protein. Let's call them i, j, and k. The distances between them must satisfy the triangle inequality. Now do this for thousands upon thousands of overlapping triangles throughout the entire protein. Every triplet of amino acids forms a triangle, and every triangle must be geometrically valid. (ML4FG slides)

triangles

This prevents nonsense predictions. Without this constraint, the model could predict that amino acids A and B are both right next to amino acid C, but somehow A and B are miles apart from each other. That's physically impossible—and the triangle inequality catches it.

Think of it as a "bag of triangles"—thousands of small geometric constraints that, when satisfied simultaneously, create a realistic protein structure. It's like a complex 3D jigsaw puzzle where every piece has to fit with every other piece perfectly.

Invariant Point Attention

Now the Structure Module still needs to think in 3D space. It does this by using a specialized attention mechanism called invariant point attention (IPA). IPA lets us represent how 3D objects behave. Crucially, no matter how you rotate or move a protein or amino acid, the internal relationships stay the same, similar to how a cup in your hand is still the same cup when you flip it over. IPA does this by being "SE(3)-equivariant" (which is math-speak for rotation and translation don't matter)—when the model is trained, we incorporate many different alignments to get the same weights.

The attention formula has three parts (ML4FG slides):

ipa

The green term is our standard sequence attention, the blue term is our learned biases from training, and the last pink term accounts for rotational (R) and translation (t) changes (it's SE(3)-equivariant!)

Building the Protein: Backbone and Sidechain Prediction

The Structure Module is like a sculptor: rough shape first, and then finer details.

Step 1: The Backbone

Every protein has a backbone: a repeating chain of amino acids where atoms follow a N–Cα–C′ pattern. The model predicts this first, using those triangle constraints we talked about. Just like in NLP where we have a "bag of words," we now have a "bag of triangles." Each triangle (or group of amino acids) is like a token that gets processed through attention mechanisms, always checking: "Does this violate physics? Does this match what the EvoFormer learned?"

After establishing the backbone, the model refines sidechains (the R groups that make each amino acid unique) using rotamer libraries (pre-computed favorable configurations) and learned torsional angles (how much things can twist). (ML4FG slides)

side chains

Side chains are important because they determine what the protein does: ex. binding to a drug molecule or interacting with other proteins.

Starting from a "Black Hole"

Here's a thought: how do we even get to this triangular rulebook in the first place? Conceptually, the Structure Module starts with all residues collapsed at the exact same point in space. Effectively, it's a "black hole" state.

Then, via 8 iterative refinement blocks, it gradually expands this collapsed structure into realistic 3D coordinates. Each iteration takes as input its EvoFormer outputs, checks triangular constraints, applies IPA attention, and asks: "Am I getting closer to physical reality?"

Part 3: FAPE Loss

Every machine learning model needs a way to know if it's doing well. AlphaFold2 uses Frame Aligned Point Error (FAPE) as its main grading system.

Contrary to its very long name, it's pretty intuitive. Compare where predicted atoms are versus where they should be. By doing this comparison after aligning local reference frames, FAPE loss is robust to global rotations and translations. Mathematically, you're looking at the mean distance between predicted and true atom positions in these aligned frames, rather than a raw Cartesian MSE (ML4FG slides).

fape

However, we've still got to consider our previous losses, so here's our full training objective (note: for training, not fine-tuning) (Jumper et. al, 2020):

full loss term

FAPE loss is still the dominant term, but Laux is the auxiliary loss from the Structure Module (averaged FAPE and torsion losses on the intermediate structures), Ldist is the averaged cross-entropy loss for pairwise, LMSA is an averaged cross-entropy loss for MSA, and Lconf is the model confidence loss.

Together, these ensure the model gets both the big picture right (global structure) and the small details right (local geometry).

Part 4: Recycling

As one might expect, running AF2 once isn't enough. So we recycle: we run this entire pipeline thrice, each time feeding previous predictions back as input. This iterative refinement loop allows the model to catch and fix inconsistencies, improving structure quality.

To put things simply, the first pass gives us a rough sketch, the second pass inks the drawing, and the third pass polishes final details.

Why This Matters Beyond Proteins

Understanding AF2's architecture unlocks a box that biologists have been trying to crack open for years. Instead of just using a hammer or a pick to pry it open, AF2 uses a variety of tools: a sledge, hammer, and a pick—we combine evolutionary context from MSAs (millions of years of natural selection), structural info from template searches (known protein structures), geometric constraints from bags of triangles (physics), and spatial attention from IPA (3D reasoning).

The Evoformer and Structure Module together form a pipeline that predicts protein structures with (at the time) unseen accuracy.

Since its release, AF2 has been used not only by biologists, but also biochemists, chemists, and physicists to predict structures of hundreds of millions of proteins. As of Dec. 2025, ~3 mil researchers have used it in drug discovery, understanding disease mechanisms, protein design, and creating novel biomaterials. AF3 extends these capabilities further, predicting protein-ligand interactions and multi-component complexes.

Of course, limitations exist. AF2 and AF3 are by no means perfect. They're still computationally expensive to run. There isn't an extremely easy way to use it (you can't just type a sequence in like you can with ChatGPT). Proteins are dynamic and shape-shifting, with multiple conformations (otherwise your cells would not interact); AF2 doesn't address this. Protein design remains challenging without additional tools.

Even so, breaking down biology into smaller mathematical units (triangles, attention mechanisms, and geometric constraints) is powerful. AF2 represents a fundamental advance in how we teach computers to understand our cellular machinery. It's attention 2.0 for structural biology, and that's all you need.

Coming next: ESM, RoseTTAFold maybe? Mech interp for PLMs/bio models

Please let me know if you catch any typos/technical errors/want to chat more! My email: bridget.g.liu@gmail.com.

References