From 05dc76efa1b0af1387d6d6b234914aa10fea96e9 Mon Sep 17 00:00:00 2001 From: Augustin Zidek Date: Fri, 31 Jan 2020 14:26:08 +0000 Subject: [PATCH] Add more details about AlphaFold features. PiperOrigin-RevId: 292534089 --- alphafold_casp13/README.md | 215 ++++++++++++++++++++++++++++++++++++- 1 file changed, 214 insertions(+), 1 deletion(-) diff --git a/alphafold_casp13/README.md b/alphafold_casp13/README.md index ee07c6d..64cece0 100644 --- a/alphafold_casp13/README.md +++ b/alphafold_casp13/README.md @@ -169,4 +169,217 @@ train/test split can be found in the `train_domains.txt` and `test_domains.txt` files in this repository. The split is based on the [CATH 2018-03-16](https://www.cathdb.info/) database. -Disclaimer: This is not an official Google product. +## Features + +There is currently no plan to open source the feature generation code as it is +tightly coupled to our internal infrastructure as well as external tools which +we cannot open source. + +Some features are needed only as placeholders to construct the model. These can +be set to all zeros when running the inference. Such features are marked in the +table below as not needed and you can just fill them with zeros when running +inference. + +The table below provides an overview of the features we used to make it possible +to reconstruct our feature generation code. Some features that require more +thorough explanation are explained in the section below the table. Note that +`NR` stands for number of residues, i.e. the length of the amino acid sequence: + +| Name | Needed | TF DType | Shape | Description | +|-----------------------------------|:------:|----------|-----------------|------------------------------------------------------------------------------------------------------------------------------| +| `aatype` | ✔️ | float32 | `(NR, 21)` | One hot encoding of amino acid types. The mapping is `ARNDCQEGHILKMFPSTWYVX -> range(21)`. See below. | +| `alpha_mask` | ❌ | int64 | `(NR, 1)` | Mask for `alpha_positions`. | +| `alpha_positions` | ❌ | float32 | `(NR, 3)` | `(x, y, z)` Carbon Alpha coordinates. | +| `beta_mask` | ❌ | int64 | `(NR, 1)` | Mask for `beta_positions`. | +| `beta_positions` | ❌ | float32 | `(NR, 3)` | `(x, y, z)` Carbon Beta coordinates. | +| `between_segment_residues` | ❌ | int64 | `(NR, 1)` | The number of between segment residues (BSR) at the next position. E.g. `ABCXXD` (`XX` is BSR) would be `[0,0,2,0]`. | +| `chain_name` | ❌ | string | `(1)` | The chain name. E.g. 'A', 'B', ... | +| `deletion_probability` | ✔️ | float32 | `(NR, 1)` | The fraction of sequences that had a deletion at this position. See below. | +| `domain_name` | ❌ | string | `(1)` | The domain name. | +| `gap_matrix` | ✔️ | float32 | `(NR, NR, 1)` | Covariation signal from the gapped states, this gives an indication of the variance induced due to gapped states. See below. | +| `hhblits_profile` | ❌ | float32 | `(NR, 22)` | A profile (probability distribution over amino acid types) computed using HHBlits MSA. Encoding: 20 amino acids + 'X' + '-'. | +| `hmm_profile` | ✔️ | float32 | `(NR, 30)` | The HHBlits HHM profile (from the `-ohhm` HHBlits output file). Asterisks in the output are replaced by 0.0. See below. | +| `key` | ❌ | string | `(1)` | The unique id of the protein. | +| `mutual_information` | ✔️ | float32 | `(NR, NR, 1)` | The average product corrected mutual information. See https://doi.org/10.1093/bioinformatics/btm604. | +| `non_gapped_profile` | ✔️ | float32 | `(NR, 21)` | A profile from amino acids only (discounting gaps). See below. | +| `num_alignments` | ✔️ | int64 | `(NR, 1)` | The number of HHBlits multiple sequence alignments. Has to be repeated `NR` times. See below. | +| `num_effective_alignments` | ❌ | float32 | `(1)` | The number of effective alignments (neff at 62 % sequence similarity). | +| `phi_angles` | ❌ | float32 | `(NR, 1)` | The phi angles. | +| `phi_mask` | ❌ | int64 | `(NR, 1)` | Mask for `phi_angles`. | +| `profile` | ❌ | float32 | `(NR, 21)` | A profile (probability distribution over amino acid types) computed using PSI-BLAST. Equivalent to the output of ChkParse. | +| `profile_with_prior` | ✔️ | float32 | `(NR, 22)` | Profile which takes into account priors and Blosum matrix. See equation 5 in https://doi.org/10.1093/nar/25.17.3389. | +| `profile_with_prior_without_gaps` | ✔️ | float32 | `(NR, 21)` | Same as `profile_with_prior` but without gaps included. | +| `pseudo_bias` | ✔️ | float32 | `(NR, 22)` | The bias computed in the MSA pseudolikelihood computation. | +| `pseudo_frob` | ✔️ | float32 | `(NR, NR, 1)` | Frobenius norm of `pseudolikelihood` (gaps not included). Similar to the output of CCMPred. | +| `pseudolikelihood` | ✔️ | float32 | `(NR, NR, 484)` | The weights computed in the MSA pseudolikelihood computation. | +| `psi_angles` | ❌ | float32 | `(NR, 1)` | The psi angles. | +| `psi_mask` | ❌ | int64 | `(NR, 1)` | Mask for `psi_angles`. | +| `residue_index` | ✔️ | int64 | `(NR, 1)` | Index of each residue giong from 0 to `NR - 1`. See below. | +| `resolution` | ❌ | float32 | `(1)` | The protein structure resolution. | +| `reweighted_profile` | ✔️ | float32 | `(NR, 22)` | Profile where sequences are reweighted to weight rarer sequences higher. See below. | +| `sec_structure` | ❌ | int64 | `(NR, 8)` | Secondary structure generated by DSSP and one-hot encoded by the mapping `-HETSGBI -> range(8)`. | +| `sec_structure_mask` | ❌ | int64 | `(NR, 1)` | Mask for `sec_structure_mask`. | +| `seq_length` | ✔️ | int64 | `(NR, 1)` | The length of the amino acid sequence. Has to be repeated `NR` times. See below. | +| `sequence` | ✔️ | string | `(1)` | The amino acid sequence (1-letter amino acid encoding). See below. | +| `solv_surf` | ❌ | float32 | `(NR, 1)` | Relative solvent accessible area computed using DSSP and then normalized by amino acid maximum accessibility. | +| `solv_surf_mask` | ❌ | int64 | `(NR, 1)` | Mask for `solv_surf`. | +| `superfamily` | ❌ | string | `(1)` | The superfamily CATH code. | + +### More details on needed features + +#### `aatype` + +One hot encoding of amino acid types. The following code converts an amino acid +string into the one-hot encoding: + +```python +def sequence_to_onehot(sequence): + """Maps the given sequence into a one-hot encoded matrix.""" + mapping = {aa: i for i, aa in enumerate('ARNDCQEGHILKMFPSTWYVX')} + num_entries = max(mapping.values()) + 1 + one_hot_arr = np.zeros((len(sequence), num_entries), dtype=np.int32) + + for aa_index, aa_type in enumerate(sequence): + aa_id = mapping[aa_type] + one_hot_arr[aa_index, aa_id] = 1 + + return one_hot_arr +``` + +#### `deletion_probability` + +The fraction of sequences that had a deletion (denoteby by a lowercase letter +in the A3M format) at this position. Example: + +``` +MSA = A c r k + A C r k + A C R k +deletion_probability = [[0], [1/3], [2/3], [1]] +``` + +#### `gap_matrix` + +Covariation signal from the gapped states, this gives an indication of the +variance induced due to gapped states. Example: + +``` +MSA = A A C D B D F J G B M A + - - C D B D F J G B M A + A A C D B - - J G B M A +gap_count = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]] +gap_matrix = np.matmul(gap_count.T, gap_count) +``` + +#### `hmm_profile` + +The HHBlits HHM profile (from the `-ohhm` HHBlits output file). Asterisks in the +output are replaced by 0.0. The following code parses the HHM file: + +```python +def extract_hmm_profile(hhm_file, sequence, asterisks_replace=0.0): + """Extracts information from the hmm file and replaces asterisks.""" + profile_part = hhm_file.split('#')[-1] + profile_part = profile_part.split('\n') + whole_profile = [i.split() for i in profile_part] + # This part strips away the header and the footer. + whole_profile = whole_profile[5:-2] + gap_profile = np.zeros((len(sequence), 10)) + aa_profile = np.zeros((len(sequence), 20)) + count_aa = 0 + count_gap = 0 + for line_values in whole_profile: + if len(line_values) == 23: + # The first and the last values in line_values are metadata, skip them. + for j, t in enumerate(line_values[2:-1]): + aa_profile[count_aa, j] = ( + 2**(-float(t) / 1000.) if t != '*' else asterisks_replace) + count_aa += 1 + elif len(line_values) == 10: + for j, t in enumerate(line_values): + gap_profile[count_gap, j] = ( + 2**(-float(t) / 1000.) if t != '*' else asterisks_replace) + count_gap += 1 + elif not line_values: + pass + else: + raise ValueError('Wrong length of line %s hhm file. Expected 0, 10 or 23' + 'got %d'%(line_values, len(line_values))) + hmm_profile = np.hstack([aa_profile, gap_profile]) + assert len(hmm_profile) == len(sequence) + return hmm_profile +``` + +#### `non_gapped_profile` + +A profile from amino acids only (discounting gaps). + +```python +def non_gapped_profile(amino_acids): + """Computes a profile from only amino acids and discounting gaps.""" + profile = np.zeros(21) + for aa in amino_acids: + if aa != 21: # Ignore gaps. + profile[aa] += 1. + return profile / np.sum(profile) +``` + +#### `num_alignments` + +The number of HHBlits multiple sequence alignments. Has to be repeated `NR` +times. For example, if there are 10 alignments for a sequence of length 8, then +`num_alignments = [[10], [10], [10], [10], [10], [10], [10], [10]]`. + +#### `pseudo_frob` +This feature collapses the 484 channels of pseudolikelihood into one by taking +the Frobenius norm of the 484 channels and then subtracting the Average Product +Correction of the computed Frobenius norm. The Frobenius norm does not take into +account the 22nd gap state. + +#### `pseudolikelihood` + +Parameters of a Potts Model coupling the amino acid types of particular residues +estimated by pseudolikelihood. See https://doi.org/10.1103/PhysRevE.87.012707 +for more details. + +#### `residue_index` + +Index of each residue giong from 0 to `NR - 1`. For example, the sequence `AACR` +has `residue_index = [[0], [1], [2], [3]]`. + +#### `reweighted_profile` + +Profile where sequences are reweighted to weight rarer sequences higher. The +sequence weights are calculated like this: + +```python +def sequence_weights(sequence_matrix): + """Compute sequence reweighting to weight rarer sequences higher.""" + num_rows, num_res = sequence_matrix.shape + cutoff = 0.62 * num_res + weights = np.ones(num_rows, dtype=np.float32) + for i in range(num_rows): + for j in range(i + 1, num_rows): + similarity = (sequence_matrix[i] == sequence_matrix[j]).sum() + if similarity > cutoff: + weights[i] += 1 + weights[j] += 1 + return 1.0 / weights +``` + +#### `seq_length` + +The length of the amino acid sequence. Has to be repeated `NR` times. For +example, the sequence `AACR` would have `seq_length = [[4], [4], [4], [4]]`. + +#### `sequence` + +The amino acid sequence (1-letter amino acid encoding). For example, a protein +with Alanine, Lysine, Arginine has `sequence = 'AKR'`. + + +# Disclaimer + +This is not an official Google product.