生信自学笔记（六）Needleman-Wunsch算法的python实现

Needleman-Wunsch 算法是一种全局联配算法，原理说起来不算难，就是首先把两条目标 DNA 序列的最前面加一个空格代表空位，然后画一个大小为（一条 DNA 序列长度 * 另一条 DNA 序列的长度）的矩阵，矩阵横纵坐标分别代表对应序列的碱基或是空位，这样就可以依据打分规则把这个矩阵填充起来。

而矩阵的初始化方法是这样的，首先填完第一行和第一列，也就是简单的其中一个是空位的配对情况，然后依次填第二行第二列、第三行第三列。每次填新的分数时，首先要选取目标位置上侧、左上、左侧的分数，然后计算从这几个方向趋近分别会再得多少分，分别和原来的分数相加得到一个估计分数，最后填上最大的估计分数或者是最少的惩罚值。

得分矩阵画好之后，就可以回溯联配了。打分我们是从左上到右下，回溯就是从右下到左上。每一次都要选择这样的位置，即，从这个位置走到当前位置，按照打分规则恰好可以获得当前位置现在的分数，这就是十分典型的深搜问题了。此外，由于回溯时可以有多个位置能够达到相同的分数，所以最终的联配结果也是有多个的，当然它们的得分也是一致的。

样例用的是教科书上的: CTGTATC vs CTATAATCCC
计分规则：碱基匹配不罚分，错配罚一分，引入空位罚三分。

计算出的得分矩阵如下

[[ 0.  3.  6.  9. 12. 15. 18. 21. 24. 27. 30.]
 [ 3.  0.  3.  6.  9. 12. 15. 18. 21. 24. 27.]
 [ 6.  3.  0.  3.  6.  9. 12. 15. 18. 21. 24.]
 [ 9.  6.  3.  1.  4.  7. 10. 13. 16. 19. 22.]
 [12.  9.  6.  4.  1.  4.  7. 10. 13. 16. 19.]
 [15. 12.  9.  6.  4.  1.  4.  7. 10. 13. 16.]
 [18. 15. 12.  9.  6.  4.  2.  4.  7. 10. 13.]
 [21. 18. 15. 12.  9.  7.  5.  3.  4.  7. 10.]]

DNA 比对结果
这里写图片描述

源码:

import numpy as np

# 对任意两个字符进行比较
def compare(m, n, match, gap, n_match):
    if m == n:
        return match
    elif " " == m or n == " ":
        return gap
    else:
        return n_match
#构建打分矩阵
def mark(seq1, seq2, match, n_match, gap):
    "用来生成最终的打分矩阵,传入形参分别为两个序列和匹配罚分、错配罚分、引入空位罚分"
    a = len(seq1)
    b = len(seq2)
    # 初始化打分矩阵
    S = np.zeros((a + 1, b + 1))
    # 为打分矩阵的第一行和第一列赋值
    S[0, :] = np.fromfunction(lambda x, y: gap * (x + y), (1, b + 1), dtype=int)
    S[:, 0] = np.fromfunction(lambda x, y: gap * (x + y), (1,a + 1), dtype=int)
    # 在字符串最前添加前导空格
    seq1 = " " + seq1[:]
    seq2 = " " + seq2[:]
    # 三目运算符要这样写
    for r in range(1, b+1 if a>b else a+1):
        # up = S[r-1,r-1:]
        # left = S[r-1:,r-1]
        for c in range(r, b+1):
            # 分别获得上方元素、对角元素、左侧元素的分数, 上方和左侧补了空格之后恒为3
            last_score = [S[r - 1, c], S[r - 1, c - 1], S[r, c - 1]]
            change_score = [gap,compare(seq1[r],seq2[c], match, gap, n_match),gap]
            #两个数组中的元素相加要这么写
            new_score = np.add(last_score, change_score)
            # print(last_score)
            # print(change_score)
            # print(new_score)
            # print("r = ",r," c = ", c)
            S[r,c] = min(new_score)
            # print(S[r,c])
        for c in range(r+1, a+1):
            # 分别获得上方元素、对角元素、左侧元素的分数
            last_score = [S[c-1,r], S[c-1, r-1], S[c, r - 1]]
            change_score = [gap,compare(seq1[c],seq2[r], match, gap, n_match),gap]
            new_score = np.add(last_score, change_score)
            S[c,r] = min(new_score)
    return S
#回溯
def traceback(seq1, seq2, match, n_match, gap, S, current_x, current_y, S1, S2, t1, t2):
    if current_x == 1 and current_y == 1:
        S1.append(seq1[1] + t1[:])
        S2.append(seq2[1] + t2[:])
        return
    if S[current_x,current_y]-S[current_x-1,current_y] == gap:
        traceback(seq1, seq2, match, n_match, gap, S, current_x-1, current_y, S1, S2, seq1[current_x]+t1[:], "-"+t2[:])
    if S[current_x,current_y]-S[current_x,current_y-1] == gap:
        traceback(seq1, seq2, match, n_match, gap, S, current_x, current_y-1, S1, S2, "-" + t1[:], seq2[current_y] + t2[:])
    if S[current_x,current_y]-S[current_x-1,current_y-1] == compare(seq1[current_x], seq2[current_y], match, gap, n_match):
        traceback(seq1, seq2, match, n_match, gap, S, current_x-1, current_y-1, S1, S2, seq1[current_x] + t1[:], seq2[current_y] + t2[:])

调用

F = mark("CTGTATC","CTATAATCCC",0,1,3)
S1 = []
S2 = []
t1 = ""
t2 = ""
traceback(" CTGTATC", " CTATAATCCC", 0, 1, 3, F, F.shape[0]-1, F.shape[1]-1, S1, S2, t1, t2)
print(S1)
print(S2)

生信自学笔记（六）Needleman-Wunsch算法的python实现

猜你喜欢