Add more details about AlphaFold features.

PiperOrigin-RevId: 292534089
This commit is contained in:
Augustin Zidek
2020-01-31 14:26:08 +00:00
committed by Diego de Las Casas
parent 0fdd013edb
commit 05dc76efa1
+214 -1
View File
@@ -169,4 +169,217 @@ train/test split can be found in the `train_domains.txt` and `test_domains.txt`
files in this repository. The split is based on the
[CATH 2018-03-16](https://www.cathdb.info/) database.
Disclaimer: This is not an official Google product.
## Features
There is currently no plan to open source the feature generation code as it is
tightly coupled to our internal infrastructure as well as external tools which
we cannot open source.
Some features are needed only as placeholders to construct the model. These can
be set to all zeros when running the inference. Such features are marked in the
table below as not needed and you can just fill them with zeros when running
inference.
The table below provides an overview of the features we used to make it possible
to reconstruct our feature generation code. Some features that require more
thorough explanation are explained in the section below the table. Note that
`NR` stands for number of residues, i.e. the length of the amino acid sequence:
| Name | Needed | TF DType | Shape | Description |
|-----------------------------------|:------:|----------|-----------------|------------------------------------------------------------------------------------------------------------------------------|
| `aatype` | ✔️ | float32 | `(NR, 21)` | One hot encoding of amino acid types. The mapping is `ARNDCQEGHILKMFPSTWYVX -> range(21)`. See below. |
| `alpha_mask` | ❌ | int64 | `(NR, 1)` | Mask for `alpha_positions`. |
| `alpha_positions` | ❌ | float32 | `(NR, 3)` | `(x, y, z)` Carbon Alpha coordinates. |
| `beta_mask` | ❌ | int64 | `(NR, 1)` | Mask for `beta_positions`. |
| `beta_positions` | ❌ | float32 | `(NR, 3)` | `(x, y, z)` Carbon Beta coordinates. |
| `between_segment_residues` | ❌ | int64 | `(NR, 1)` | The number of between segment residues (BSR) at the next position. E.g. `ABCXXD` (`XX` is BSR) would be `[0,0,2,0]`. |
| `chain_name` | ❌ | string | `(1)` | The chain name. E.g. 'A', 'B', ... |
| `deletion_probability` | ✔️ | float32 | `(NR, 1)` | The fraction of sequences that had a deletion at this position. See below. |
| `domain_name` | ❌ | string | `(1)` | The domain name. |
| `gap_matrix` | ✔️ | float32 | `(NR, NR, 1)` | Covariation signal from the gapped states, this gives an indication of the variance induced due to gapped states. See below. |
| `hhblits_profile` | ❌ | float32 | `(NR, 22)` | A profile (probability distribution over amino acid types) computed using HHBlits MSA. Encoding: 20 amino acids + 'X' + '-'. |
| `hmm_profile` | ✔️ | float32 | `(NR, 30)` | The HHBlits HHM profile (from the `-ohhm` HHBlits output file). Asterisks in the output are replaced by 0.0. See below. |
| `key` | ❌ | string | `(1)` | The unique id of the protein. |
| `mutual_information` | ✔️ | float32 | `(NR, NR, 1)` | The average product corrected mutual information. See https://doi.org/10.1093/bioinformatics/btm604. |
| `non_gapped_profile` | ✔️ | float32 | `(NR, 21)` | A profile from amino acids only (discounting gaps). See below. |
| `num_alignments` | ✔️ | int64 | `(NR, 1)` | The number of HHBlits multiple sequence alignments. Has to be repeated `NR` times. See below. |
| `num_effective_alignments` | ❌ | float32 | `(1)` | The number of effective alignments (neff at 62 % sequence similarity). |
| `phi_angles` | ❌ | float32 | `(NR, 1)` | The phi angles. |
| `phi_mask` | ❌ | int64 | `(NR, 1)` | Mask for `phi_angles`. |
| `profile` | ❌ | float32 | `(NR, 21)` | A profile (probability distribution over amino acid types) computed using PSI-BLAST. Equivalent to the output of ChkParse. |
| `profile_with_prior` | ✔️ | float32 | `(NR, 22)` | Profile which takes into account priors and Blosum matrix. See equation 5 in https://doi.org/10.1093/nar/25.17.3389. |
| `profile_with_prior_without_gaps` | ✔️ | float32 | `(NR, 21)` | Same as `profile_with_prior` but without gaps included. |
| `pseudo_bias` | ✔️ | float32 | `(NR, 22)` | The bias computed in the MSA pseudolikelihood computation. |
| `pseudo_frob` | ✔️ | float32 | `(NR, NR, 1)` | Frobenius norm of `pseudolikelihood` (gaps not included). Similar to the output of CCMPred. |
| `pseudolikelihood` | ✔️ | float32 | `(NR, NR, 484)` | The weights computed in the MSA pseudolikelihood computation. |
| `psi_angles` | ❌ | float32 | `(NR, 1)` | The psi angles. |
| `psi_mask` | ❌ | int64 | `(NR, 1)` | Mask for `psi_angles`. |
| `residue_index` | ✔️ | int64 | `(NR, 1)` | Index of each residue giong from 0 to `NR - 1`. See below. |
| `resolution` | ❌ | float32 | `(1)` | The protein structure resolution. |
| `reweighted_profile` | ✔️ | float32 | `(NR, 22)` | Profile where sequences are reweighted to weight rarer sequences higher. See below. |
| `sec_structure` | ❌ | int64 | `(NR, 8)` | Secondary structure generated by DSSP and one-hot encoded by the mapping `-HETSGBI -> range(8)`. |
| `sec_structure_mask` | ❌ | int64 | `(NR, 1)` | Mask for `sec_structure_mask`. |
| `seq_length` | ✔️ | int64 | `(NR, 1)` | The length of the amino acid sequence. Has to be repeated `NR` times. See below. |
| `sequence` | ✔️ | string | `(1)` | The amino acid sequence (1-letter amino acid encoding). See below. |
| `solv_surf` | ❌ | float32 | `(NR, 1)` | Relative solvent accessible area computed using DSSP and then normalized by amino acid maximum accessibility. |
| `solv_surf_mask` | ❌ | int64 | `(NR, 1)` | Mask for `solv_surf`. |
| `superfamily` | ❌ | string | `(1)` | The superfamily CATH code. |
### More details on needed features
#### `aatype`
One hot encoding of amino acid types. The following code converts an amino acid
string into the one-hot encoding:
```python
def sequence_to_onehot(sequence):
"""Maps the given sequence into a one-hot encoded matrix."""
mapping = {aa: i for i, aa in enumerate('ARNDCQEGHILKMFPSTWYVX')}
num_entries = max(mapping.values()) + 1
one_hot_arr = np.zeros((len(sequence), num_entries), dtype=np.int32)
for aa_index, aa_type in enumerate(sequence):
aa_id = mapping[aa_type]
one_hot_arr[aa_index, aa_id] = 1
return one_hot_arr
```
#### `deletion_probability`
The fraction of sequences that had a deletion (denoteby by a lowercase letter
in the A3M format) at this position. Example:
```
MSA = A c r k
A C r k
A C R k
deletion_probability = [[0], [1/3], [2/3], [1]]
```
#### `gap_matrix`
Covariation signal from the gapped states, this gives an indication of the
variance induced due to gapped states. Example:
```
MSA = A A C D B D F J G B M A
- - C D B D F J G B M A
A A C D B - - J G B M A
gap_count = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]]
gap_matrix = np.matmul(gap_count.T, gap_count)
```
#### `hmm_profile`
The HHBlits HHM profile (from the `-ohhm` HHBlits output file). Asterisks in the
output are replaced by 0.0. The following code parses the HHM file:
```python
def extract_hmm_profile(hhm_file, sequence, asterisks_replace=0.0):
"""Extracts information from the hmm file and replaces asterisks."""
profile_part = hhm_file.split('#')[-1]
profile_part = profile_part.split('\n')
whole_profile = [i.split() for i in profile_part]
# This part strips away the header and the footer.
whole_profile = whole_profile[5:-2]
gap_profile = np.zeros((len(sequence), 10))
aa_profile = np.zeros((len(sequence), 20))
count_aa = 0
count_gap = 0
for line_values in whole_profile:
if len(line_values) == 23:
# The first and the last values in line_values are metadata, skip them.
for j, t in enumerate(line_values[2:-1]):
aa_profile[count_aa, j] = (
2**(-float(t) / 1000.) if t != '*' else asterisks_replace)
count_aa += 1
elif len(line_values) == 10:
for j, t in enumerate(line_values):
gap_profile[count_gap, j] = (
2**(-float(t) / 1000.) if t != '*' else asterisks_replace)
count_gap += 1
elif not line_values:
pass
else:
raise ValueError('Wrong length of line %s hhm file. Expected 0, 10 or 23'
'got %d'%(line_values, len(line_values)))
hmm_profile = np.hstack([aa_profile, gap_profile])
assert len(hmm_profile) == len(sequence)
return hmm_profile
```
#### `non_gapped_profile`
A profile from amino acids only (discounting gaps).
```python
def non_gapped_profile(amino_acids):
"""Computes a profile from only amino acids and discounting gaps."""
profile = np.zeros(21)
for aa in amino_acids:
if aa != 21: # Ignore gaps.
profile[aa] += 1.
return profile / np.sum(profile)
```
#### `num_alignments`
The number of HHBlits multiple sequence alignments. Has to be repeated `NR`
times. For example, if there are 10 alignments for a sequence of length 8, then
`num_alignments = [[10], [10], [10], [10], [10], [10], [10], [10]]`.
#### `pseudo_frob`
This feature collapses the 484 channels of pseudolikelihood into one by taking
the Frobenius norm of the 484 channels and then subtracting the Average Product
Correction of the computed Frobenius norm. The Frobenius norm does not take into
account the 22nd gap state.
#### `pseudolikelihood`
Parameters of a Potts Model coupling the amino acid types of particular residues
estimated by pseudolikelihood. See https://doi.org/10.1103/PhysRevE.87.012707
for more details.
#### `residue_index`
Index of each residue giong from 0 to `NR - 1`. For example, the sequence `AACR`
has `residue_index = [[0], [1], [2], [3]]`.
#### `reweighted_profile`
Profile where sequences are reweighted to weight rarer sequences higher. The
sequence weights are calculated like this:
```python
def sequence_weights(sequence_matrix):
"""Compute sequence reweighting to weight rarer sequences higher."""
num_rows, num_res = sequence_matrix.shape
cutoff = 0.62 * num_res
weights = np.ones(num_rows, dtype=np.float32)
for i in range(num_rows):
for j in range(i + 1, num_rows):
similarity = (sequence_matrix[i] == sequence_matrix[j]).sum()
if similarity > cutoff:
weights[i] += 1
weights[j] += 1
return 1.0 / weights
```
#### `seq_length`
The length of the amino acid sequence. Has to be repeated `NR` times. For
example, the sequence `AACR` would have `seq_length = [[4], [4], [4], [4]]`.
#### `sequence`
The amino acid sequence (1-letter amino acid encoding). For example, a protein
with Alanine, Lysine, Arginine has `sequence = 'AKR'`.
# Disclaimer
This is not an official Google product.