mirror of
https://github.com/google-deepmind/deepmind-research.git
synced 2026-05-09 21:07:49 +08:00
Add more details about AlphaFold features.
PiperOrigin-RevId: 292534089
This commit is contained in:
committed by
Diego de Las Casas
parent
0fdd013edb
commit
05dc76efa1
+214
-1
@@ -169,4 +169,217 @@ train/test split can be found in the `train_domains.txt` and `test_domains.txt`
|
||||
files in this repository. The split is based on the
|
||||
[CATH 2018-03-16](https://www.cathdb.info/) database.
|
||||
|
||||
Disclaimer: This is not an official Google product.
|
||||
## Features
|
||||
|
||||
There is currently no plan to open source the feature generation code as it is
|
||||
tightly coupled to our internal infrastructure as well as external tools which
|
||||
we cannot open source.
|
||||
|
||||
Some features are needed only as placeholders to construct the model. These can
|
||||
be set to all zeros when running the inference. Such features are marked in the
|
||||
table below as not needed and you can just fill them with zeros when running
|
||||
inference.
|
||||
|
||||
The table below provides an overview of the features we used to make it possible
|
||||
to reconstruct our feature generation code. Some features that require more
|
||||
thorough explanation are explained in the section below the table. Note that
|
||||
`NR` stands for number of residues, i.e. the length of the amino acid sequence:
|
||||
|
||||
| Name | Needed | TF DType | Shape | Description |
|
||||
|-----------------------------------|:------:|----------|-----------------|------------------------------------------------------------------------------------------------------------------------------|
|
||||
| `aatype` | ✔️ | float32 | `(NR, 21)` | One hot encoding of amino acid types. The mapping is `ARNDCQEGHILKMFPSTWYVX -> range(21)`. See below. |
|
||||
| `alpha_mask` | ❌ | int64 | `(NR, 1)` | Mask for `alpha_positions`. |
|
||||
| `alpha_positions` | ❌ | float32 | `(NR, 3)` | `(x, y, z)` Carbon Alpha coordinates. |
|
||||
| `beta_mask` | ❌ | int64 | `(NR, 1)` | Mask for `beta_positions`. |
|
||||
| `beta_positions` | ❌ | float32 | `(NR, 3)` | `(x, y, z)` Carbon Beta coordinates. |
|
||||
| `between_segment_residues` | ❌ | int64 | `(NR, 1)` | The number of between segment residues (BSR) at the next position. E.g. `ABCXXD` (`XX` is BSR) would be `[0,0,2,0]`. |
|
||||
| `chain_name` | ❌ | string | `(1)` | The chain name. E.g. 'A', 'B', ... |
|
||||
| `deletion_probability` | ✔️ | float32 | `(NR, 1)` | The fraction of sequences that had a deletion at this position. See below. |
|
||||
| `domain_name` | ❌ | string | `(1)` | The domain name. |
|
||||
| `gap_matrix` | ✔️ | float32 | `(NR, NR, 1)` | Covariation signal from the gapped states, this gives an indication of the variance induced due to gapped states. See below. |
|
||||
| `hhblits_profile` | ❌ | float32 | `(NR, 22)` | A profile (probability distribution over amino acid types) computed using HHBlits MSA. Encoding: 20 amino acids + 'X' + '-'. |
|
||||
| `hmm_profile` | ✔️ | float32 | `(NR, 30)` | The HHBlits HHM profile (from the `-ohhm` HHBlits output file). Asterisks in the output are replaced by 0.0. See below. |
|
||||
| `key` | ❌ | string | `(1)` | The unique id of the protein. |
|
||||
| `mutual_information` | ✔️ | float32 | `(NR, NR, 1)` | The average product corrected mutual information. See https://doi.org/10.1093/bioinformatics/btm604. |
|
||||
| `non_gapped_profile` | ✔️ | float32 | `(NR, 21)` | A profile from amino acids only (discounting gaps). See below. |
|
||||
| `num_alignments` | ✔️ | int64 | `(NR, 1)` | The number of HHBlits multiple sequence alignments. Has to be repeated `NR` times. See below. |
|
||||
| `num_effective_alignments` | ❌ | float32 | `(1)` | The number of effective alignments (neff at 62 % sequence similarity). |
|
||||
| `phi_angles` | ❌ | float32 | `(NR, 1)` | The phi angles. |
|
||||
| `phi_mask` | ❌ | int64 | `(NR, 1)` | Mask for `phi_angles`. |
|
||||
| `profile` | ❌ | float32 | `(NR, 21)` | A profile (probability distribution over amino acid types) computed using PSI-BLAST. Equivalent to the output of ChkParse. |
|
||||
| `profile_with_prior` | ✔️ | float32 | `(NR, 22)` | Profile which takes into account priors and Blosum matrix. See equation 5 in https://doi.org/10.1093/nar/25.17.3389. |
|
||||
| `profile_with_prior_without_gaps` | ✔️ | float32 | `(NR, 21)` | Same as `profile_with_prior` but without gaps included. |
|
||||
| `pseudo_bias` | ✔️ | float32 | `(NR, 22)` | The bias computed in the MSA pseudolikelihood computation. |
|
||||
| `pseudo_frob` | ✔️ | float32 | `(NR, NR, 1)` | Frobenius norm of `pseudolikelihood` (gaps not included). Similar to the output of CCMPred. |
|
||||
| `pseudolikelihood` | ✔️ | float32 | `(NR, NR, 484)` | The weights computed in the MSA pseudolikelihood computation. |
|
||||
| `psi_angles` | ❌ | float32 | `(NR, 1)` | The psi angles. |
|
||||
| `psi_mask` | ❌ | int64 | `(NR, 1)` | Mask for `psi_angles`. |
|
||||
| `residue_index` | ✔️ | int64 | `(NR, 1)` | Index of each residue giong from 0 to `NR - 1`. See below. |
|
||||
| `resolution` | ❌ | float32 | `(1)` | The protein structure resolution. |
|
||||
| `reweighted_profile` | ✔️ | float32 | `(NR, 22)` | Profile where sequences are reweighted to weight rarer sequences higher. See below. |
|
||||
| `sec_structure` | ❌ | int64 | `(NR, 8)` | Secondary structure generated by DSSP and one-hot encoded by the mapping `-HETSGBI -> range(8)`. |
|
||||
| `sec_structure_mask` | ❌ | int64 | `(NR, 1)` | Mask for `sec_structure_mask`. |
|
||||
| `seq_length` | ✔️ | int64 | `(NR, 1)` | The length of the amino acid sequence. Has to be repeated `NR` times. See below. |
|
||||
| `sequence` | ✔️ | string | `(1)` | The amino acid sequence (1-letter amino acid encoding). See below. |
|
||||
| `solv_surf` | ❌ | float32 | `(NR, 1)` | Relative solvent accessible area computed using DSSP and then normalized by amino acid maximum accessibility. |
|
||||
| `solv_surf_mask` | ❌ | int64 | `(NR, 1)` | Mask for `solv_surf`. |
|
||||
| `superfamily` | ❌ | string | `(1)` | The superfamily CATH code. |
|
||||
|
||||
### More details on needed features
|
||||
|
||||
#### `aatype`
|
||||
|
||||
One hot encoding of amino acid types. The following code converts an amino acid
|
||||
string into the one-hot encoding:
|
||||
|
||||
```python
|
||||
def sequence_to_onehot(sequence):
|
||||
"""Maps the given sequence into a one-hot encoded matrix."""
|
||||
mapping = {aa: i for i, aa in enumerate('ARNDCQEGHILKMFPSTWYVX')}
|
||||
num_entries = max(mapping.values()) + 1
|
||||
one_hot_arr = np.zeros((len(sequence), num_entries), dtype=np.int32)
|
||||
|
||||
for aa_index, aa_type in enumerate(sequence):
|
||||
aa_id = mapping[aa_type]
|
||||
one_hot_arr[aa_index, aa_id] = 1
|
||||
|
||||
return one_hot_arr
|
||||
```
|
||||
|
||||
#### `deletion_probability`
|
||||
|
||||
The fraction of sequences that had a deletion (denoteby by a lowercase letter
|
||||
in the A3M format) at this position. Example:
|
||||
|
||||
```
|
||||
MSA = A c r k
|
||||
A C r k
|
||||
A C R k
|
||||
deletion_probability = [[0], [1/3], [2/3], [1]]
|
||||
```
|
||||
|
||||
#### `gap_matrix`
|
||||
|
||||
Covariation signal from the gapped states, this gives an indication of the
|
||||
variance induced due to gapped states. Example:
|
||||
|
||||
```
|
||||
MSA = A A C D B D F J G B M A
|
||||
- - C D B D F J G B M A
|
||||
A A C D B - - J G B M A
|
||||
gap_count = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||
[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]]
|
||||
gap_matrix = np.matmul(gap_count.T, gap_count)
|
||||
```
|
||||
|
||||
#### `hmm_profile`
|
||||
|
||||
The HHBlits HHM profile (from the `-ohhm` HHBlits output file). Asterisks in the
|
||||
output are replaced by 0.0. The following code parses the HHM file:
|
||||
|
||||
```python
|
||||
def extract_hmm_profile(hhm_file, sequence, asterisks_replace=0.0):
|
||||
"""Extracts information from the hmm file and replaces asterisks."""
|
||||
profile_part = hhm_file.split('#')[-1]
|
||||
profile_part = profile_part.split('\n')
|
||||
whole_profile = [i.split() for i in profile_part]
|
||||
# This part strips away the header and the footer.
|
||||
whole_profile = whole_profile[5:-2]
|
||||
gap_profile = np.zeros((len(sequence), 10))
|
||||
aa_profile = np.zeros((len(sequence), 20))
|
||||
count_aa = 0
|
||||
count_gap = 0
|
||||
for line_values in whole_profile:
|
||||
if len(line_values) == 23:
|
||||
# The first and the last values in line_values are metadata, skip them.
|
||||
for j, t in enumerate(line_values[2:-1]):
|
||||
aa_profile[count_aa, j] = (
|
||||
2**(-float(t) / 1000.) if t != '*' else asterisks_replace)
|
||||
count_aa += 1
|
||||
elif len(line_values) == 10:
|
||||
for j, t in enumerate(line_values):
|
||||
gap_profile[count_gap, j] = (
|
||||
2**(-float(t) / 1000.) if t != '*' else asterisks_replace)
|
||||
count_gap += 1
|
||||
elif not line_values:
|
||||
pass
|
||||
else:
|
||||
raise ValueError('Wrong length of line %s hhm file. Expected 0, 10 or 23'
|
||||
'got %d'%(line_values, len(line_values)))
|
||||
hmm_profile = np.hstack([aa_profile, gap_profile])
|
||||
assert len(hmm_profile) == len(sequence)
|
||||
return hmm_profile
|
||||
```
|
||||
|
||||
#### `non_gapped_profile`
|
||||
|
||||
A profile from amino acids only (discounting gaps).
|
||||
|
||||
```python
|
||||
def non_gapped_profile(amino_acids):
|
||||
"""Computes a profile from only amino acids and discounting gaps."""
|
||||
profile = np.zeros(21)
|
||||
for aa in amino_acids:
|
||||
if aa != 21: # Ignore gaps.
|
||||
profile[aa] += 1.
|
||||
return profile / np.sum(profile)
|
||||
```
|
||||
|
||||
#### `num_alignments`
|
||||
|
||||
The number of HHBlits multiple sequence alignments. Has to be repeated `NR`
|
||||
times. For example, if there are 10 alignments for a sequence of length 8, then
|
||||
`num_alignments = [[10], [10], [10], [10], [10], [10], [10], [10]]`.
|
||||
|
||||
#### `pseudo_frob`
|
||||
This feature collapses the 484 channels of pseudolikelihood into one by taking
|
||||
the Frobenius norm of the 484 channels and then subtracting the Average Product
|
||||
Correction of the computed Frobenius norm. The Frobenius norm does not take into
|
||||
account the 22nd gap state.
|
||||
|
||||
#### `pseudolikelihood`
|
||||
|
||||
Parameters of a Potts Model coupling the amino acid types of particular residues
|
||||
estimated by pseudolikelihood. See https://doi.org/10.1103/PhysRevE.87.012707
|
||||
for more details.
|
||||
|
||||
#### `residue_index`
|
||||
|
||||
Index of each residue giong from 0 to `NR - 1`. For example, the sequence `AACR`
|
||||
has `residue_index = [[0], [1], [2], [3]]`.
|
||||
|
||||
#### `reweighted_profile`
|
||||
|
||||
Profile where sequences are reweighted to weight rarer sequences higher. The
|
||||
sequence weights are calculated like this:
|
||||
|
||||
```python
|
||||
def sequence_weights(sequence_matrix):
|
||||
"""Compute sequence reweighting to weight rarer sequences higher."""
|
||||
num_rows, num_res = sequence_matrix.shape
|
||||
cutoff = 0.62 * num_res
|
||||
weights = np.ones(num_rows, dtype=np.float32)
|
||||
for i in range(num_rows):
|
||||
for j in range(i + 1, num_rows):
|
||||
similarity = (sequence_matrix[i] == sequence_matrix[j]).sum()
|
||||
if similarity > cutoff:
|
||||
weights[i] += 1
|
||||
weights[j] += 1
|
||||
return 1.0 / weights
|
||||
```
|
||||
|
||||
#### `seq_length`
|
||||
|
||||
The length of the amino acid sequence. Has to be repeated `NR` times. For
|
||||
example, the sequence `AACR` would have `seq_length = [[4], [4], [4], [4]]`.
|
||||
|
||||
#### `sequence`
|
||||
|
||||
The amino acid sequence (1-letter amino acid encoding). For example, a protein
|
||||
with Alanine, Lysine, Arginine has `sequence = 'AKR'`.
|
||||
|
||||
|
||||
# Disclaimer
|
||||
|
||||
This is not an official Google product.
|
||||
|
||||
Reference in New Issue
Block a user