From 05dc76efa1b0af1387d6d6b234914aa10fea96e9 Mon Sep 17 00:00:00 2001
From: Augustin Zidek <augustinzidek@google.com>
Date: Fri, 31 Jan 2020 14:26:08 +0000
Subject: [PATCH] Add more details about AlphaFold features.

PiperOrigin-RevId: 292534089
---
 alphafold_casp13/README.md | 215 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 214 insertions(+), 1 deletion(-)

diff --git a/alphafold_casp13/README.md b/alphafold_casp13/README.md
index ee07c6d..64cece0 100644
--- a/alphafold_casp13/README.md
+++ b/alphafold_casp13/README.md
@@ -169,4 +169,217 @@ train/test split can be found in the `train_domains.txt` and `test_domains.txt`
 files in this repository. The split is based on the
 [CATH 2018-03-16](https://www.cathdb.info/) database.
 
-Disclaimer: This is not an official Google product.
+## Features
+
+There is currently no plan to open source the feature generation code as it is
+tightly coupled to our internal infrastructure as well as external tools which
+we cannot open source.
+
+Some features are needed only as placeholders to construct the model. These can
+be set to all zeros when running the inference. Such features are marked in the
+table below as not needed and you can just fill them with zeros when running
+inference.
+
+The table below provides an overview of the features we used to make it possible
+to reconstruct our feature generation code. Some features that require more
+thorough explanation are explained in the section below the table. Note that
+`NR` stands for number of residues, i.e. the length of the amino acid sequence:
+
+| Name                              | Needed | TF DType | Shape           | Description                                                                                                                  |
+|-----------------------------------|:------:|----------|-----------------|------------------------------------------------------------------------------------------------------------------------------|
+| `aatype`                          | ✔️      | float32  | `(NR, 21)`      | One hot encoding of amino acid types. The mapping is `ARNDCQEGHILKMFPSTWYVX -> range(21)`. See below.                        |
+| `alpha_mask`                      | ❌     | int64    | `(NR, 1)`       | Mask for `alpha_positions`.                                                                                                  |
+| `alpha_positions`                 | ❌     | float32  | `(NR, 3)`       | `(x, y, z)` Carbon Alpha coordinates.                                                                                        |
+| `beta_mask`                       | ❌     | int64    | `(NR, 1)`       | Mask for `beta_positions`.                                                                                                   |
+| `beta_positions`                  | ❌     | float32  | `(NR, 3)`       | `(x, y, z)` Carbon Beta coordinates.                                                                                         |
+| `between_segment_residues`        | ❌     | int64    | `(NR, 1)`       | The number of between segment residues (BSR) at the next position. E.g. `ABCXXD` (`XX` is BSR) would be `[0,0,2,0]`.         |
+| `chain_name`                      | ❌     | string   | `(1)`           | The chain name. E.g. 'A', 'B', ...                                                                                           |
+| `deletion_probability`            | ✔️      | float32  | `(NR, 1)`       | The fraction of sequences that had a deletion at this position. See below.                                                   |
+| `domain_name`                     | ❌     | string   | `(1)`           | The domain name.                                                                                                             |
+| `gap_matrix`                      | ✔️      | float32  | `(NR, NR, 1)`   | Covariation signal from the gapped states, this gives an indication of the variance induced due to gapped states. See below. |
+| `hhblits_profile`                 | ❌     | float32  | `(NR, 22)`      | A profile (probability distribution over amino acid types) computed using HHBlits MSA. Encoding: 20 amino acids + 'X' + '-'. |
+| `hmm_profile`                     | ✔️      | float32  | `(NR, 30)`      | The HHBlits HHM profile (from the `-ohhm` HHBlits output file). Asterisks in the output are replaced by 0.0. See below.      |
+| `key`                             | ❌     | string   | `(1)`           | The unique id of the protein.                                                                                                |
+| `mutual_information`              | ✔️      | float32  | `(NR, NR, 1)`   | The average product corrected mutual information. See https://doi.org/10.1093/bioinformatics/btm604.                         |
+| `non_gapped_profile`              | ✔️      | float32  | `(NR, 21)`      | A profile from amino acids only (discounting gaps). See below.                                                               |
+| `num_alignments`                  | ✔️      | int64    | `(NR, 1)`       | The number of HHBlits multiple sequence alignments. Has to be repeated `NR` times. See below.                                |
+| `num_effective_alignments`        | ❌     | float32  | `(1)`           | The number of effective alignments (neff at 62 % sequence similarity).                                                       |
+| `phi_angles`                      | ❌     | float32  | `(NR, 1)`       | The phi angles.                                                                                                              |
+| `phi_mask`                        | ❌     | int64    | `(NR, 1)`       | Mask for `phi_angles`.                                                                                                       |
+| `profile`                         | ❌     | float32  | `(NR, 21)`      | A profile (probability distribution over amino acid types) computed using PSI-BLAST. Equivalent to the output of ChkParse.   |
+| `profile_with_prior`              | ✔️      | float32  | `(NR, 22)`      | Profile which takes into account priors and Blosum matrix. See equation 5 in https://doi.org/10.1093/nar/25.17.3389.         |
+| `profile_with_prior_without_gaps` | ✔️      | float32  | `(NR, 21)`      | Same as `profile_with_prior` but without gaps included.                                                                      |
+| `pseudo_bias`                     | ✔️      | float32  | `(NR, 22)`      | The bias computed in the MSA pseudolikelihood computation.                                                                   |
+| `pseudo_frob`                     | ✔️      | float32  | `(NR, NR, 1)`   | Frobenius norm of `pseudolikelihood` (gaps not included). Similar to the output of CCMPred.                                  |
+| `pseudolikelihood`                | ✔️      | float32  | `(NR, NR, 484)` | The weights computed in the MSA pseudolikelihood computation.                                                                |
+| `psi_angles`                      | ❌     | float32  | `(NR, 1)`       | The psi angles.                                                                                                              |
+| `psi_mask`                        | ❌     | int64    | `(NR, 1)`       | Mask for `psi_angles`.                                                                                                       |
+| `residue_index`                   | ✔️      | int64    | `(NR, 1)`       | Index of each residue giong from 0 to `NR - 1`. See below.                                                                   |
+| `resolution`                      | ❌     | float32  | `(1)`           | The protein structure resolution.                                                                                            |
+| `reweighted_profile`              | ✔️      | float32  | `(NR, 22)`      | Profile where sequences are reweighted to weight rarer sequences higher. See below.                                          |
+| `sec_structure`                   | ❌     | int64    | `(NR, 8)`       | Secondary structure generated by DSSP and one-hot encoded by the mapping `-HETSGBI -> range(8)`.                             |
+| `sec_structure_mask`              | ❌     | int64    | `(NR, 1)`       | Mask for `sec_structure_mask`.                                                                                               |
+| `seq_length`                      | ✔️      | int64    | `(NR, 1)`       | The length of the amino acid sequence. Has to be repeated `NR` times. See below.                                             |
+| `sequence`                        | ✔️      | string   | `(1)`           | The amino acid sequence (1-letter amino acid encoding). See below.                                                           |
+| `solv_surf`                       | ❌     | float32  | `(NR, 1)`       | Relative solvent accessible area computed using DSSP and then normalized by amino acid maximum accessibility.                |
+| `solv_surf_mask`                  | ❌     | int64    | `(NR, 1)`       | Mask for `solv_surf`.                                                                                                        |
+| `superfamily`                     | ❌     | string   | `(1)`           | The superfamily CATH code.                                                                                                   |
+
+### More details on needed features
+
+#### `aatype`
+
+One hot encoding of amino acid types. The following code converts an amino acid
+string into the one-hot encoding:
+
+```python
+def sequence_to_onehot(sequence):
+  """Maps the given sequence into a one-hot encoded matrix."""
+  mapping = {aa: i for i, aa in enumerate('ARNDCQEGHILKMFPSTWYVX')}
+  num_entries = max(mapping.values()) + 1
+  one_hot_arr = np.zeros((len(sequence), num_entries), dtype=np.int32)
+
+  for aa_index, aa_type in enumerate(sequence):
+    aa_id = mapping[aa_type]
+    one_hot_arr[aa_index, aa_id] = 1
+
+  return one_hot_arr
+```
+
+#### `deletion_probability`
+
+The fraction of sequences that had a deletion (denoteby by a lowercase letter
+in the A3M format) at this position. Example:
+
+```
+MSA = A c r k
+      A C r k
+      A C R k
+deletion_probability = [[0], [1/3], [2/3], [1]]
+```
+
+#### `gap_matrix`
+
+Covariation signal from the gapped states, this gives an indication of the
+variance induced due to gapped states. Example:
+
+```
+MSA = A A C D B D F J G B M A
+      - - C D B D F J G B M A
+      A A C D B - - J G B M A
+gap_count = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+             [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+             [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]]
+gap_matrix = np.matmul(gap_count.T, gap_count)
+```
+
+#### `hmm_profile`
+
+The HHBlits HHM profile (from the `-ohhm` HHBlits output file). Asterisks in the
+output are replaced by 0.0. The following code parses the HHM file:
+
+```python
+def extract_hmm_profile(hhm_file, sequence, asterisks_replace=0.0):
+  """Extracts information from the hmm file and replaces asterisks."""
+  profile_part = hhm_file.split('#')[-1]
+  profile_part = profile_part.split('\n')
+  whole_profile = [i.split() for i in profile_part]
+  # This part strips away the header and the footer.
+  whole_profile = whole_profile[5:-2]
+  gap_profile = np.zeros((len(sequence), 10))
+  aa_profile = np.zeros((len(sequence), 20))
+  count_aa = 0
+  count_gap = 0
+  for line_values in whole_profile:
+    if len(line_values) == 23:
+      # The first and the last values in line_values are metadata, skip them.
+      for j, t in enumerate(line_values[2:-1]):
+        aa_profile[count_aa, j] = (
+            2**(-float(t) / 1000.) if t != '*' else asterisks_replace)
+      count_aa += 1
+    elif len(line_values) == 10:
+      for j, t in enumerate(line_values):
+        gap_profile[count_gap, j] = (
+            2**(-float(t) / 1000.) if t != '*' else asterisks_replace)
+      count_gap += 1
+    elif not line_values:
+      pass
+    else:
+      raise ValueError('Wrong length of line %s hhm file. Expected 0, 10 or 23'
+                       'got %d'%(line_values, len(line_values)))
+  hmm_profile = np.hstack([aa_profile, gap_profile])
+  assert len(hmm_profile) == len(sequence)
+  return hmm_profile
+```
+
+#### `non_gapped_profile`
+
+A profile from amino acids only (discounting gaps).
+
+```python
+def non_gapped_profile(amino_acids):
+  """Computes a profile from only amino acids and discounting gaps."""
+  profile = np.zeros(21)
+  for aa in amino_acids:
+    if aa != 21:  # Ignore gaps.
+      profile[aa] += 1.
+  return profile / np.sum(profile)
+```
+
+#### `num_alignments`
+
+The number of HHBlits multiple sequence alignments. Has to be repeated `NR`
+times. For example, if there are 10 alignments for a sequence of length 8, then
+`num_alignments = [[10], [10], [10], [10], [10], [10], [10], [10]]`.
+
+#### `pseudo_frob`
+This feature collapses the 484 channels of pseudolikelihood into one by taking
+the Frobenius norm of the 484 channels and then subtracting the Average Product
+Correction of the computed Frobenius norm. The Frobenius norm does not take into
+account the 22nd gap state.
+
+#### `pseudolikelihood`
+
+Parameters of a Potts Model coupling the amino acid types of particular residues
+estimated by pseudolikelihood. See https://doi.org/10.1103/PhysRevE.87.012707
+for more details.
+
+#### `residue_index`
+
+Index of each residue giong from 0 to `NR - 1`. For example, the sequence `AACR`
+has `residue_index = [[0], [1], [2], [3]]`.
+
+#### `reweighted_profile`
+
+Profile where sequences are reweighted to weight rarer sequences higher. The
+sequence weights are calculated like this:
+
+```python
+def sequence_weights(sequence_matrix):
+  """Compute sequence reweighting to weight rarer sequences higher."""
+  num_rows, num_res = sequence_matrix.shape
+  cutoff = 0.62 * num_res
+  weights = np.ones(num_rows, dtype=np.float32)
+  for i in range(num_rows):
+    for j in range(i + 1, num_rows):
+      similarity = (sequence_matrix[i] == sequence_matrix[j]).sum()
+      if similarity > cutoff:
+        weights[i] += 1
+        weights[j] += 1
+  return 1.0 / weights
+```
+
+#### `seq_length`
+
+The length of the amino acid sequence. Has to be repeated `NR` times. For
+example, the sequence `AACR` would have  `seq_length = [[4], [4], [4], [4]]`.
+
+#### `sequence`
+
+The amino acid sequence (1-letter amino acid encoding). For example, a protein
+with Alanine, Lysine, Arginine has `sequence = 'AKR'`.
+
+
+# Disclaimer
+
+This is not an official Google product.