题目

5000 ms / 32768 kB

题目描述

The twenty-first century is a biology-technology developing century. We know that a gene is made of DNA. The nucleotide bases from which DNA is built are A(adenine), C(cytosine), G(guanine), and T(thymine). Finding the longest common subsequence between DNA/Protein sequences is one of the basic problems in modern computational molecular biology. But this problem is a little different. Given several DNA sequences, you are asked to make a shortest sequence from them so that each of the given sequence is the subsequence of it.

For example, given “ACGT”,”ATGC”,”CGTT” and “CAGT”, you can make a sequence in the following way. It is the shortest but may be not the only one.

输入

The first line is the test case number t. Then t test cases follow. In each case, the first line is an integer n ( 1<=n<=8 ) represents number of the DNA sequences. The following k lines contain the k sequences, one per line. Assuming that the length of any sequence is between 1 and 5.

输出

For each test case, print a line containing the length of the shortest sequence that can be made from these sequences.

样例输入

1
4
ACGT
ATGC
CGTT
CAGT

样例输出

题目大意

样例图示
找到一个最短的字符串，使得给出的N个串都是它的子串，输出长度。
给出的串（DNA）只由 $A$ , $C$ , $G$ , $T$ 4种字符组成。

分析

看到这道题，你可能想到了一堆字符串算法。
很遗憾的是，它是一道搜索题= =

搜索方式

直接搜索答案。

给每个待匹配字符串 $S[i]$ 一个 $pos$ ， $pos$ 指向它第一个还未匹配的字符。
例如在样例中，如果我们搜索到了 $ACAG$ ，那么：

$S[1].pos = 4$ 即 $T$ ;
$S[2].pos = 2$ 即 $T$ ;
$S[3].pos = 3$ 即 $T$ ;
$S[4].pos = 4$ 即 $T$ .

每次递归下一层时，就从每个字符串的pos指向的字符中选一个，作为当前阶段枚举的答案。
如上面的例子，此时（第5层递归）的答案只有一个： $T$ ，如果选其他的字符是没有用的，因为它不能和任何串匹配。

迭代加深

由于不知道答案有多长，所以要用迭代加深。
迭代加深一般情况下的模板：

int MaxD;//最大深度
bool dfs(int i){//i为当前深度
    if(/*找到解*/)
        return true;
    for(/*深搜扩展可行解*/){
        if(dfs(i+1))
            return true;
        /*恢复现场*/
    }
    return false;
}
int main(){
    MaxD=最少需要步数
    while(!dfs(0))
        MaxD++;
}

启发式搜索

事实证明，迭代加深大多都伴随着A*算法。
于是我们只需要深搜开始时判断一下：

int MaxD;//最大深度
bool dfs(int i){//i为当前深度
    if(/*找到解*/)
        return true;
    if(i+/*最优情况下还剩下的步数*/>MaxD)
        return false;
    for(/*深搜扩展可行解*/){
        if(dfs(i+1))
            return true;
        /*恢复现场*/
    }
    return false;
}

乐观估计

在做这道题之前，我一直搞不懂这个是要“乐观”估计，还是“悲观”估计。
由于我们要进行剪枝，所以，如果在最优情况下，都不能在 $MaxD$ 步之内完成，更坏情况下就更不能完成了。

方法一

我们可以认为，在很理想的情况下，只需要匹配 $Max(S[i]还未匹配的字符数)(1\leq i\leq N)$ 次即可，也就是：

int Wrong(){
    int ret=0;
    for(int i=1;i<=N;i++)
        ret=max(ret,A[i].len-A[i].pos+1);
    return ret;
}

注意：我们是乐观估计，再次强调。
虽然这种方法有一些太乐观了……所以减掉的枝不多，于是：
提交情况
谢天谢地谢时限……

方法二

这种方法就要人性化一点了。
找出 $S[i](1\leq i\leq N)$ 中还未匹配部分含 $A$ 最多的数量、含 $C$ 最多的数量，以及含 $T$ 最多的数量和含 $G$ 最多的数量，把这些加起来，即：

char dict[5]={0,'A','C','G','T'};//4种字符
int Wrong(){
    int ret=0;
    for(int i=1;i<=4;i++){//枚举字符
        int Max=0;
        for(int j=1;j<=N;j++){//枚举每个字符串
            int tmp=0;
            for(int k=A[j].pos;k<=A[j].len;k++)
                tmp+=A[j].str[k]==dict[i];
            Max=max(Max,tmp);//找到最大的
        }
        ret+=Max;//累加
    }
    return ret;
}

于是：
t提交情况

代码

这是方法2的代码，方法1只需要改一下Wrong函数即可。

#include<vector>
#include<cstdio>
#include<cstring>
#include<algorithm>
using namespace std;

#define MAXN 8
#define MAXL 5
struct DNA{
    int len,pos;//长度和第一个未匹配位置
    char str[MAXL+5];//字符串
}A[MAXN+5];
int N;

int MaxD;
char dict[5]={0,'A','C','G','T'};
//启发函数，前面已讲
int Wrong(){
    int ret=0;
    for(int i=1;i<=4;i++){
        int Max=0;
        for(int j=1;j<=N;j++){
            int tmp=0;
            for(int k=A[j].pos;k<=A[j].len;k++)
                tmp+=A[j].str[k]==dict[i];
            Max=max(Max,tmp);
        }
        ret+=Max;
    }
    return ret;
}
//IDA*
bool dfs(int i){
    int tmp=Wrong();
    if(tmp==0) return true;
    if(i+tmp>MaxD) return false;
    for(int j=1;j<=4;j++){//枚举4种字符
        //即当前往答案里push_back(dict[j])
        vector<int> Updated;
        //Updated用于恢复现场
        for(int k=1;k<=N;k++)
            if(A[k].str[A[k].pos]==dict[j]){//只改变pos的位置为dict[j]的DNA
                A[k].pos++;
                Updated.push_back(k);//将改变了的记录一下
            }
        if(Updated.size()){//如果一个都没有改变，就不用递归了
            if(dfs(i+1))
                return true;
            for(int k=0;k<int(Updated.size());k++)//恢复现场
                A[Updated[k]].pos--;
        }
    }
    return false;
}

int main(){
    int T;
    scanf("%d",&T);
    while(T--){
        scanf("%d",&N);
        for(int i=1;i<=N;i++){
            A[i].pos=1;
            scanf("%s",A[i].str+1);
            MaxD=max(MaxD,A[i].len=strlen(A[i].str+1));
            //MaxD至少的长度就是DNA中最长的一个
        }
        while(!dfs(0))
            MaxD++;
        printf("%d\n",MaxD);
        MaxD=0;
        memset(A,0,sizeof A);
    }
    return 0;
}

【IDA*】DNA sequence

题目