See the Supplementary Strategies for extra particulars all through.
Information curation
Spherical 1 CuSOD
UniProt74 sequences containing precisely one Sod_Cu Pfam75 area have been downloaded. Hmmsearch (Hmmer, http://hmmer.org/; ref. 76) recognized the Sod_Cu area envelopes. Sequences have been truncated to take away extraneous sequences past the bounds of the Sod_Cu match. Extra high quality filtering was carried out. Sequence duplicates have been eliminated utilizing CD-HIT77 at an id threshold of 80%, and 80% and 20% have been randomly sorted right into a ‘coaching’ and a ‘check’ set, respectively. A coaching MSA was generated by an iterative course of utilizing MUSCLE (v3.8)78.
Spherical 1 MDH
All UniProt sequences containing an Ldh_1_N Pfam area adopted by an Ldh_1_C area have been downloaded. LDH and MDH enzymes, based mostly on enzyme fee quantity79, 1.1.1.27 for LDH and 1.1.1.37 for MDH, have been downloaded from SwissProt. MUSCLE and hmmbuild have been used to construct profile hidden Markov fashions of each units. Hmmsearch was used to attain every UniProt Ldh_1_N/Ldh_1_C sequence in opposition to the MDH and LDH profiles and sequences that had a stronger match to the MDH profile have been retained. Extra processing was carried out precisely as with the spherical 1 CuSOD information curation.
Quantification of area architectures
See the Supplementary Strategies.
Spherical 2 CuSOD pretest
UniProt CuSOD proteins have been obtained as described above (spherical 1 CuSOD). The dominion of origin for every sequence was obtained from the UniProt annotation. Transmembrane domains and sign peptides have been predicted utilizing Phobius66. Sequences with transmembrane domains have been discarded. Sign peptides have been faraway from sequences predicted to include them. A set of 14 consultant CuSOD and a couple of FeSOD proteins have been manually chosen for experimental screening, together with eukaryotic, viral and bacterial proteins predicted to not include sign peptides, and bacterial proteins with predicted sign peptides eliminated.
Rounds 2 and three CuSOD
All eukaryotic transcriptomes obtainable from the NCBI Transcriptome Shotgun Meeting (TSA) sequence database80 have been downloaded. Transdecoder (https://github.com/TransDecoder/TransDecoder) was used to extract the protein sequences from transcriptomes. Hmmsearch76 was used to determine proteins with precisely one Sod_Cu area. This set of proteins was mixed with the listing of eukaryotic and viral CuSOD proteins from UniProt. Extra high quality filtering was carried out. Sequences that have been greater than 85% equivalent (based mostly on usearch81 search_global) to a sequence screened in a earlier spherical have been discarded. The remaining sequences have been deduplicated at 90% utilizing CD-HIT after which break up 90% and 10% into coaching and check teams, respectively. A coaching MSA was generated.
Rounds 2 and three MDH
Hmmsearch76 was used to go looking Mgnify82 for sequences containing precisely one Ldh_1_C and one Ldh_1_N area. The listing of Mgnify proteins was added to the listing of UniProt (curation described above). Extra high quality filtering was carried out. Sequences have been deduplicated at 90% utilizing CD-HIT. Sequences with id better than 85%, based mostly on usearch search_global, to a sequence experimentally screened in spherical 1 have been discarded. The remaining sequences have been break up into coaching (90%) and check (10%) units. A coaching MSA was generated.
Phylogenetic timber
Bushes have been constructed utilizing FastTree from MSAs generated by MAFFT83. Bushes have been rooted and the midpoint was rendered utilizing ETE3 (ref. 84).
Chorismate mutase and lysozymes
See the Supplementary Strategies.
Generative fashions
ESM-MSA-1b sampling
Sequences have been generated by iterative masking and sampling utilizing the ESM-MSA-1b mannequin48. ESM-MSA-1b is a neural community mannequin educated to fill within the wild-type amino acids in masked positions of a protein MSA. The mannequin can be utilized to generate new sequences by working MSA masking and prediction iteratively, every time changing the wild-type amino acids on the masked positions with an amino acid drawn from the likelihood distribution returned by the mannequin. Using masked language fashions to generate new sequences was first proposed by Wang and Cho50, and the technique has been utilized to protein sequences in a minimum of three prior works22,49,85.
See the Supplementary Strategies for extra element on the parameters used.
ProteinGAN
Generative adversarial fashions have been educated utilizing the coaching units for CuSOD and MDH. Then, for every household, sequences have been generated by sampling vectors from the latent house utilizing a truncated regular distribution. For rounds 1 and a couple of, 10,048 sequences have been generated for every household. For spherical 3, 560,016 and 160,064 sequences have been generated for CuSOD and MDH, respectively.
Ancestral sequence reconstruction
Most-likelihood timber have been generated from the coaching set reference MSAs utilizing FastTree86. Ancestral sequence reconstructions have been generated from the timber utilizing the joint reconstruction perform of the GRASP28 command line instrument. Metrics have been calculated, and candidates have been chosen from all the set of reconstructed sequences.
ProGen
See the Supplementary Strategies.
Computational metrics
AlphaFold2
AlphaFold2 (ref. 44) was used to foretell the constructions of check sequences and all generated sequences that handed the primary filtering step.
Phobius
The jphobius66 (https://phobius.sbc.su.se/information.html) executable was used to foretell the presence of sign peptides or transmembrane domains.
ESM-1v and CARP-640M
Scores calculated from the ESM-1v39 and CARP-640M68 fashions have been the common of the log possibilities of the amino acid in every place. With out masking, this calculation may be performed with a single ahead cross over every sequence. With partial masking, it may be performed in various passes equal to at least one per masked_fraction.
ESM-MSA
Scores from the ESM-MSA-1b48 mannequin have been calculated in a way just like that for ESM-1v scores, utilizing the common log likelihood throughout the entire sequence. The metric was calculated utilizing phmmer76 to search out the 31 closest coaching sequences to every question, align the 32 sequences with MAFFT and calculate the common log possibilities from six passes with a masking interval of six.
ProteinMPNN, ESM-IF and MIF-ST
The proteinMPNN45 and ESM-IF46 scores are the common log probability of the question residues utilizing the AlphaFold2-predicted construction. The MIF-ST47 rating was calculated utilizing the extract_mif.py script from the protein sequence fashions repository (https://github.com/microsoft/protein-sequence-models).
Rosetta-relax
The Rosetta (v2020.08.61146)43 chill out program was used to chill out the AlphaFold2 constructions.
Distance to the closest coaching sequence
Probably the most related coaching sequence was discovered utilizing ggsearch36 from the FASTA bundle87, the BLOSUM62 scoring matrix and a spot open penalty of 10 and hole lengthen penalty of two. The Hamming distance was then calculated from the gapped alignment between the question and the highest hit sequences. Id was calculated as 1 − Hamming_distance.
BLOSUM62 and PFASUM15 mutant place imply
The closest coaching sequence was discovered utilizing ggsearch36 as described above. From the alignment to the closest coaching sequence, the imply BLOSUM62 rating37 throughout all mismatched positions was calculated, ignoring positions the place both the question or the reference had a spot. We additionally calculated the alignments and scores utilizing an alternate matrix, the PFASUM15 matrix67.
Longest repeat
Scores have been calculated for the longest single-amino acid repeat and the longest 2-mer, 3-mer and 4-mer repeat in every sequence. The scores have been calculated as −1 ⨯ the variety of repeat items. Subsequently, the sequence AAAAAA would have a single-amino acid repeat rating of −6, a 2-mer rating of −3, a 3-mer rating of −2 and a 4-mer rating of −1. The sequence LALALALA would have a 1-mer rating of −1, a 2-mer rating of −4, a 3-mer rating of −1 and a 4-mer rating of −2.
SASA
SASA, polar SASA and apolar SASA have been calculated from the AlphaFold2-predicted constructions utilizing the freesasa bundle (https://freesasa.github.io/). The share of polar SASA was calculated utilizing the system 100 ⨯ polar SASA/SASA.
Web cost, Abs(web cost) and charged fraction
Expenses have been calculated by summing the numbers of glutamate and aspartate residues and lysine and arginine residues for damaging and optimistic fees, respectively.
Avg(phmmer prime 30)
The phmmer prime 30 common rating was calculated by working a phmmer search of the experimentally examined sequences in opposition to the coaching sequences and averaging the scores of the highest 30 hits.
Number of sequences for in vitro assays
Spherical 1
The chosen sequences had 70% and 80% id to the closest coaching sequence and numerous scores on the ESM-1v metric.
Spherical 2 pretest
CuSOD sequences have been chosen on the idea of the dominion of origin (eukaryotic, viral or bacterial) and the presence of Phobius-predicted sign peptides. Sequences with predicted sign peptides have been truncated on the predicted sign peptide cleavage website. Two bacterial FeSOD proteins, each missing a predicted sign peptide, and the beforehand characterised63 E. coli FeSOD (as a optimistic management) have been additionally assayed.
Spherical 2
The chosen sequences had between 80% and 90% id to the closest coaching set sequence and numerous scores on the ESM-1v and ESM-MSA metrics. Sequences have been additionally filtered by guide inspection to take away these with giant insertions or deletions in comparison with the closest reference sequences or lengthy repeats, and a methionine was added to the beginning of some of the sequences.
Spherical 3
Sequences have been chosen on the idea of a collection of filters. The primary filter eliminated sequences having (1) lower than 50% or better than 80% id to the closest coaching sequence; (2) an ESM-1v rating under the highest tenth percentile threshold in comparison with the check sequences; (3) no beginning methionine; (4) a predicted transmembrane area; and (5) a single-amino acid repeat longer than three amino acids or an amino acid pair repeat longer than 4 amino acids, as repeats have been extra frequent in ESM-MSA-generated sequences than in pure sequences (Supplementary Fig. 35). For every enzyme household, 200 ESM-MSA-generated sequences and 200 GAN-generated sequences have been randomly chosen from the sequences that handed the primary filter, and their constructions have been predicted with AlphaFold2. ProteinMPNN scores have been calculated for every construction, and the 40 sequences with the very best scores from every mannequin−enzyme mixture have been retained. Of the highest 40 sequences, 18 have been randomly chosen for expression and useful characterization. For every passing sequence that was chosen for useful characterization, a corresponding management sequence was chosen from the listing of sequences that failed the sequence filter. Management sequences have been equivalent to the closest coaching sequence inside 1% of the passing sequence.
Newly generated ProGen lysozyme sequences
See the Supplementary Strategies.
Experimental assays
Bacterial strains, plasmids and progress situations
E. coli BL21(DE3) was used because the host pressure for MDH and SOD expression on this examine. Cells have been grown on LB medium at 37 °C and supplemented with 100 μg ml−1 ampicillin (cat. no.171254, Merck).
Sequences have been optimized based mostly on E. coli-preferred codons utilizing the Twist Bioscience net interface (www.twistbioscience.com). A 30-bp sequence (TTTGTTTAACTTTAAGAAGGAGATATACAT) composed of ribosomal binding website sequences and a spacer have been added on the 5′ terminus of all genes. Genes have been ordered from Twist Bioscience as clones in pET-21(+) between the EcoRI and NotI websites.
The pET21b plasmid harboring the MDH4 gene from a earlier examine17 was used as a optimistic management for MDH enzymes. Human SOD1 (ref. 61) (hSOD, GenBank: NP_000445.1), Potentilla atrosanguinea CuSOD62 (paSOD, GenBank: AFN42318.1) and E. coli SOD63 (E.SOD, GenBank: NP_416173.1) have been codon optimized, synthesized as described above and used as optimistic controls for SOD enzymes. Clean plasmid pET21b was used as a damaging management for each MDH and SOD enzymes.
Plasmid development for truncated management sequences
See the Supplementary Strategies and Supplementary Desk 7.
Competent cell preparation and plasmid transformation
Competent cells of E. coli BL21(DE3) have been ready utilizing the calcium chloride methodology88.
See the Supplementary Strategies for particulars.
Protein expression and purification
Protein expression was achieved by diluting the in a single day cultures 1:30 into 2.5 ml autoinduction Terrific Broth (TB) medium together with hint parts (cat. no. AIMTB0210, Formedium) and supplemented with 100 µg ml−1 ampicillin in a 24-well format. All cells have been cultivated in 24-well plates in an Eppendorf ThermoMixer C. For MDH expression, cells have been grown for 4 h at 37 °C, adopted by in a single day progress at 16 °C whereas shaking at 200 rpm. For SOD expression, cells have been grown for 4 h at 37 °C, adopted by one other 3 h at 25 °C with shaking at 200 rpm.
Cells have been collected by centrifugation at 3,000g for 10 min. Cell pellets have been suspended in 200 μl BugBuster reagent (cat. no. 70584, Merck) supplemented with 1 μl 2,000 U ml−1 DNase I (cat. no. 79254, Qiagen) and incubated at 37 °C with shaking at 200 rpm for 30 min. After incubation, 10-μl mixtures have been aliquoted and saved in −20 °C as the whole protein (T) pattern for gel electrophoresis. The combination was centrifuged at most pace for 10 min and the pellets have been discarded. Then, 10 μl of the supernatant was aliquoted and saved at −20 °C because the soluble protein (S) pattern for gel electrophoresis. The supernatants have been used for protein purification utilizing the next procedures.
Talon resins (cat. no. 635653, Takara Bio) have been washed twice with a binding buffer (50 mM NaH2PO4, 300 mM NaCl, 10 mM imidazole, pH 7.4) after which suspended in the identical quantity of binding buffer because the resin mattress quantity. Talon resin (50 µl) was loaded into Pierce microspin columns (cat. no. 89879, ThermoFisher). Every supernatant pattern was added to the loaded column and incubated at 4 °C for 30 min in a thermomixer.
The columns have been then centrifuged at 20g for 30 s and the stream waste was discarded. Resins have been washed with 600 µl of wash buffer thrice (50 mM NaH2PO4, 300 mM NaCl, 20 mM imidazole, pH 7.4) and centrifuged at 20g for 30 s every time. Lastly, the resins have been incubated with 100 µl of elution buffer at 4 °C for 30 min in a thermomixer and proteins have been then eluted with centrifugation at 20g for 1 min. One other 100 µl of elution buffer was added to repeat the elution steps, and the 2 parts of elutions have been individually combined. The 2 eluate fractions have been then mixed and transferred to a 96-well desalting plate (cat. no. 89807, Thermo Scientific), which was pre-equilibrated with the pattern buffer (50 mM NaH2PO4, 300 mM NaCl, pH 7.4). Protein samples have been saved at −80 °C after including 1⨯ protein-stabilizing cocktail (cat. no. 89806, Thermo Scientific). Then, 10 μl of the proteins was aliquoted and saved at −20 °C because the purified protein (P) pattern for gel electrophoresis.
For enzymes from spherical 2 and spherical 3 and the truncated enzymes from the spherical 2 pretest, protein concentrations have been measured by Qubit Protein Assay (cat. no. Q33211, Thermo Scientific).
Gel electrophoresis
Whole, soluble and purified proteins of every pattern have been combined with 1⨯ loading buffer (4⨯ loading buffer recipe: 0.2 M Tris-HCl, 0.4 M DTT, 277 mM SDS, 6 mM bromophenol blue, 4.3 M glycerol) after which heated at 85 °C for five min in a PCR cycler. Denatured proteins have been analyzed by SDS–PAGE with precast gels (cat. no. WG1403A, Thermo Scientific), adopted by Coomassie staining with InstantBlue (cat. no. ISB1L-53, Kem-en-tec). Spectra multicolor broad-range protein ladder (cat. no. 26634, Thermo Scientific) was additionally loaded to investigate the protein sizes.
Enzymatic assay
To check for MDH exercise, 2 μl or 100 μg ml−1 of purified protein in spherical 1 was added to a response combination containing roughly 1.5 mM NADH (cat. no. 10128023001, Merck), 2.0 mM oxaloacetic acid (cat. no. O4126, Sigma) and 20 mM HEPES buffer (pH 7.4). Assays have been carried out in triplicate in a 96-well format. All elements have been added utilizing multichannel pipettes to keep away from the response time lag of every effectively. The ultimate response quantity was 100 µl, and the response was carried out at room temperature in a clear 96-well microplate (cat. no. 0020821, Sarstedt). MDH exercise was measured in triplicate by following NADH oxidation to NAD+, with an absorbance studying at 340 nm carried out in kinetics mode for 15 min in a BMG Labtech SPECTROstar nano spectrophotometer. Unspecific oxidation of NADH was monitored within the no-substrate controls, and these values have been subtracted from the opposite samples. Conversion from absorption values to NADH focus was carried out utilizing Beer−Lambert legislation c = A/(d ⨯ ε), through which the extinction coefficient ε worth is 6.22 mM−1 cm−1, and the trail size for 100 μl in a 96-well plate (d) is 0.29 cm. For samples that didn’t present any catalytic actions, a tenfold quantity, which is 20 μl of purified proteins, was used to carry out the assay for a second time.
For MDH in spherical 2 and spherical 3, 20 μg ml−1 enzymes along with the positive-control MDH4 have been used within the assay as described above for quantitative comparability of catalytic actions, excerpt for samples 1564 and 1546 from spherical 2, for which the focus of 0.2 μg ml−1 was used because of low protein yields.
SOD exercise was measured with a SOD assay equipment (cat. no. 19160, Sigma) in a 96-well format, and all elements have been added utilizing multichannel pipettes to keep away from the response time lag of every effectively. For SOD from spherical 1, an aliquot (2 µl) of purified protein was added to every effectively containing 98 µl working answer. Assays of every pattern have been carried out in triplicate and in a single ‘No XO’ effectively. xanthine oxidase working answer (10 μl) was added to every effectively on the finish, apart from the ‘No XO’ wells. ‘No SOD’ and ‘clean’ assays have been additionally carried out in triplicate. ‘No SOD’ wells contained 10 µl dilution buffer, 80 µl working answer and 10 μl xanthine oxidase working answer, whereas ‘clean’ wells contained 20 µl dilution buffer and 80 µl working answer. Plates have been incubated within the plate reader, which was preset at 37 °C. Absorbance at 450 nm was measured within the kinetics mode for 30 min. For proteins that didn’t present any catalytic exercise, a tenfold quantity of 20 μl of purified proteins was used to carry out the assay a second time.
For SOD from spherical 2 and spherical 3, 5 μg ml−1 of enzymes have been used within the assay as described above for quantitative comparability of catalytic exercise.
To assay the truncated proteins, 85 μg ml−1 of all samples have been used within the enzymatic assay.
For particulars on the lysozyme assays see the Supplementary Strategies.
Information evaluation
For MDH, the absorbance worth was plotted over time. The absorbance values of all samples on the endpoint of the assay have been in comparison with the damaging management by t-test evaluation. Samples have been thought-about lively if the top absorbance worth was considerably decrease than that of the damaging management, P ≤ 0.05.
For SOD, enzyme exercise was measured as the share inhibition of the speed of WST-1 formazan formation and calculated utilizing the next equation with absorbance worth at 20 min. The inhibition charge was in comparison with the damaging management by the t-test, and people with exercise considerably larger than the damaging management have been thought-about lively with P ≤ 0.05.
SOD exercise (inhibition charge %) = ((A − B) − (C − D))/(A − B) ⨯ 100, the place A is the absorbance worth of the ‘no SOD’ management, B is the absorbance worth of the clean, C is the absorbance worth of the pattern and D is the absorbance worth of the ‘no XO’.
Assay information have been analyzed utilizing GraphPad Prism v8.0.0 for Home windows, GraphPad Software program (www.graphpad.com).
Semiquantitative comparisons of enzyme actions
Information from spherical 3 enzyme assays utilizing 20 µg ml−1 MDH or 5 µg ml−1 SOD, as described above, have been used for semiquantitative comparisons of enzyme-specific exercise (Fig. 3d).
For MDH, MDH4 was used as a wild-type optimistic management, and for SOD, hSOD, paSOD and E.SOD have been used as wild-type optimistic controls.
For MDH, absorbance at 340 nm was transformed to NADH focus and the common distinction within the focus between the 0 and 90 s time factors of the assay was used as a measure of enzyme exercise. Some enzymes, together with the MDH4 management, transformed substrate in a short time, such that a lot of the substrate was transformed earlier than the primary time level. Subsequently, we changed any values under 275 µM at time 0 with the imply worth from the damaging management. Values have been averaged over three technical replicates and divided by the common of the MDH4 samples.
For SOD, the inhibition charge (%), calculated as described above, was used as a measure of enzyme exercise. Values have been averaged over three technical replicates and divided by the common of the hSOD, paSOD and E.SOD samples.
Reporting abstract
Additional info on analysis design is offered within the Nature Portfolio Reporting Abstract linked to this text.