PSP - AlphaFold2 根据 Species 进行 MSA Pairing 的源码解析

欢迎关注我的CSDN：https://spike.blog.csdn.net/
本文地址：https://blog.csdn.net/caroline_wendy/article/details/131399818

MSA Paring

AlphaFold2 Multimer 能够预测多肽链之间相互作用的方法，使用 MSA Pairing 的技术。MSA Pairing 是指通过比较 MSA 来寻找共进化的残基对，这些残基对可能反映了蛋白质之间的接触。AlphaFold2 Multimer 通过计算 MSA Pairing 的得分，来评估不同的多肽链组合的可能性，从而生成最优的三维结构。AlphaFold2 Multimer 提高蛋白质结构预测的准确性和效率，为生物学和医学研究提供了有价值的信息。

官方版本的 AlphaFold2 中 MSA Pairing 的物种信息 (Species) 只处理 uniprot_hits.sto 文件中的信息，其他 MSA 文件不作处理。

1. 解析 sto 文件

stockholm格式是msa的文件格式，解析 sto 文件，提取 MSA 序列和描述，源码如下：

def parse_stockholm(stockholm_string: str):
    """Parses sequences and deletion matrix from stockholm format alignment.

    Args:
    stockholm_string: The string contents of a stockholm file. The first
      sequence in the file should be the query sequence.

    Returns:
    A tuple of:
      * A list of sequences that have been aligned to the query. These
        might contain duplicates.
      * The deletion matrix for the alignment as a list of lists. The element
        at `deletion_matrix[i][j]` is the number of residues deleted from
        the aligned sequence i at residue position j.
      * The names of the targets matched, including the jackhmmer subsequence
        suffix.
    """
    name_to_sequence = collections.OrderedDict()
    for line in stockholm_string.splitlines():
        line = line.strip()
        if not line or line.startswith(('#', '//')):
            continue
        name, sequence = line.split()
        if name not in name_to_sequence:
            name_to_sequence[name] = ''
        name_to_sequence[name] += sequence

    msa = []
    deletion_matrix = []

    query = ''
    keep_columns = []
    for seq_index, sequence in enumerate(name_to_sequence.values()):
        if seq_index == 0:
            # Gather the columns with gaps from the query
            query = sequence
            keep_columns = [i for i, res in enumerate(query) if res != '-']

        # Remove the columns with gaps in the query from all sequences.
        aligned_sequence = ''.join([sequence[c] for c in keep_columns])

        msa.append(aligned_sequence)

        # Count the number of deletions w.r.t. query.
        deletion_vec = []
        deletion_count = 0
        for seq_res, query_res in zip(sequence, query):
            if seq_res != '-' or query_res != '-':
                if query_res == '-':
                    deletion_count += 1
                else:
                    deletion_vec.append(deletion_count)
                    deletion_count = 0
        deletion_matrix.append(deletion_vec)

    return msa, list(name_to_sequence.keys())

stockholm格式文件的描述信息，主要关注于 GS 至 DE 之间的部分，类似 a3m 中的 > 描述：

stockholm

解析出的信息如下：

[Info] sample 173 desc sp|P0C2N4|YSCX_YEREN/1-122, seq MSRIITAPHIGIEKLSAISLEELSCGLPDRYALPPDGHPVEPHLERLYPTAQSKRSLWDFASPGYTFHGLHRAQDYRRELDTLQSLLTTSQSSELQAAAALLKCQQDDDRLLQIILNLLHKV
[Info] species: YEREN
[Info] per_seq_similarity: 1.0
[Info] sample 173 desc tr|A0A0H3NZW9|A0A0H3NZW9_YERE1/1-122, seq MSRIITAPHIGIEKLSAISLEELSCGLPDRYALPPDGHPVEPHLERLYPTAQSKRSLWDFASPGYTFHGLHRAQDYRRELDTLQSLLTTSQSSELQAAAALLKCQQDDDRLLQIILNLLHKV
[Info] species: YERE1
[Info] per_seq_similarity: 1.0
[Info] sample 173 desc sp|A1JU78|YSCX_YERE8/1-122, seq MSRIITAPHIGIEKLSAISLEELSCGLPERYALPPDGHPVEPHLERLYPTAQSKRSLWDFASPGYTFHGLHRAQDYRRELDTLQSLLTTSQSSELQAAAALLKCQQDDDRLLQIILNLLHKV
[Info] species: YERE8
[Info] per_seq_similarity: 0.9918032786885246
[Info] sample 173 desc sp|P61416|YSCX_YERPE/1-122, seq MSRIITAPHIGIEKLSAISLEELSCGLPERYALPPDGHPVEPHLERLYPTAQSKRSLWDFASPGYTFHGLHRAQDYRRELDTLQSLLTTSQSSELQAAAALLKCQQDDDRLLQIILNLLHKV
[Info] species: YERPE
[Info] per_seq_similarity: 0.9918032786885246

2. 提取物种信息

使用序列描述提取物种信息，即|所分割的第3部分中的后半段，源码如下：

例如描述是sp|P0C2N4|YSCX_YEREN/1-122，物种是YEREN

# Sequences coming from UniProtKB database come in the
# `db|UniqueIdentifier|EntryName` format, e.g. `tr|A0A146SKV9|A0A146SKV9_FUNHE`
# or `sp|P0C2L1|A3X1_LOXLA` (for TREMBL/Swiss-Prot respectively).
_UNIPROT_PATTERN = re.compile(
    r"""
    ^
    # UniProtKB/TrEMBL or UniProtKB/Swiss-Prot
    (?:tr|sp)
    \|
    # A primary accession number of the UniProtKB entry.
    (?P<AccessionIdentifier>[A-Za-z0-9]{6,10})
    # Occasionally there is a _0 or _1 isoform suffix, which we ignore.
    (?:_\d)?
    \|
    # TREMBL repeats the accession ID here. Swiss-Prot has a mnemonic
    # protein ID code.
    (?:[A-Za-z0-9]+)
    _
    # A mnemonic species identification code.
    (?P<SpeciesIdentifier>([A-Za-z0-9]){1,5})
    # Small BFD uses a final value after an underscore, which we ignore.
    (?:_\d+)?
    $
    """,
    re.VERBOSE)


def _parse_sequence_identifier(msa_sequence_identifier: str):
    """Gets species from an msa sequence identifier.

    The sequence identifier has the format specified by
    _UNIPROT_TREMBL_ENTRY_NAME_PATTERN or _UNIPROT_SWISSPROT_ENTRY_NAME_PATTERN.
    An example of a sequence identifier: `tr|A0A146SKV9|A0A146SKV9_FUNHE`

    Args:
        msa_sequence_identifier: a sequence identifier.

    Returns:
        An `Identifiers` instance with species_id. These
        can be empty in the case where no identifier was found.
    """
    matches = re.search(_UNIPROT_PATTERN, msa_sequence_identifier.strip())
    if matches:
        return matches.group('SpeciesIdentifier')
    return ""


def _extract_sequence_identifier(description: str):
    """Extracts sequence identifier from description. Returns None if no match."""
    split_description = description.split()
    if split_description:
        return split_description[0].partition('/')[0]
    else:
        return None


def get_identifiers(description: str):
    """Computes extra MSA features from the description."""
    sequence_identifier = _extract_sequence_identifier(description)
    if sequence_identifier is None:
        return ""
    else:
        return _parse_sequence_identifier(sequence_identifier)

3. 计算序列相似度

按照氨基酸一致性百分比计算相似度，完全相同是1.0，源码如下：

def msa_seq_similarity(query_seq, msa_seq):
    """
    MSA 序列相似度
    """
    query_seq = np.array(list(query_seq))
    msa_seq = np.array(list(msa_seq))
    seq_similarity = np.sum(query_seq == msa_seq) / float(len(query_seq))
    return seq_similarity

测试脚本，如下：

class MultimerSpeciesParser(object):
    def __init__(self):
        pass

    def process(self, a_path, b_path):
        assert os.path.isfile(a_path)
        assert os.path.isfile(b_path)
        print(f"[Info] A 链路径: {
      
      a_path}")
        print(f"[Info] B 链路径: {
      
      b_path}")
        with open(a_path) as f:
            sto_string = f.read()
        a_msa, a_desc = parse_stockholm(sto_string)
        assert len(a_msa) == len(a_desc)
        for i in range(1, 5):
            print(f"[Info] sample {
      
      len(a_msa)} desc {
      
      a_desc[i]}, seq {
      
      a_msa[i]}")
            sid = get_identifiers(a_desc[i])
            print(f"[Info] species: {
      
      sid}")
            per_seq_similarity = msa_seq_similarity(a_msa[0], a_msa[i])
            print(f"[Info] per_seq_similarity: {
      
      per_seq_similarity}")