Despite the immense utility of AlphaFold2 in predicting protein structure, the official implementation excludes code for its training procedure and the associated required data. This makes it difficult to study the model’s learning behavior and to create variants that can perform new tasks. Writing in Nature Methods, AlQuraishi and colleagues now report OpenFold, a trainable and open-source implementation of AlphaFold2 that provides insights into its learning mechanisms and capacity for generalization.
OpenFold was trained from scratch using OpenProteinSet — an open-source reproduction of AlphaFold2’s training dataset — and was shown to match AlphaFold2 in accuracy. To understand specific properties of the architecture (such as data efficiency), the authors trained OpenFold in a series of runs that used progressively fewer data, which showed that it can achieve high accuracy using datasets as small as 1,000 protein chains. OpenFold was then trained with out-of-distribution data to evaluate its capacity for generalization, which revealed that the model appears to learn from local patterns of multiple sequence alignments and/or sequence–structure correlations rather than from patterns at the global fold level. Analysis of intermediate structures further revealed that although the model ultimately predicts global structure almost as accurately as local structure, it starts with learning the latter.