DNA Laboratory

Time Limit: 5000MS Memory Limit: 30000K
Total Submissions: 3242 Accepted: 615

Description

Background
Having started to build his own DNA lab just recently, the evil doctor Frankenstein is not quite up to date yet. He wants to extract his DNA, enhance it somewhat and clone himself. He has already figured out how to extract DNA from some of his blood cells, but unfortunately reading off the DNA sequence means breaking the DNA into a number of short pieces and analyzing those first. Frankenstein has not quite understood how to put the pieces together to recover the original sequence.
His pragmatic approach to the problem is to sneak into university and to kidnap a number of smart looking students. Not surprisingly, you are one of them, so you would better come up with a solution pretty fast.
Problem
You are given a list of strings over the alphabet A (for adenine), C (cytosine), G (guanine), and T (thymine),and your task is to find the shortest string (which is typically not listed) that contains all given strings as substrings.
If there are several such strings of shortest length, find the smallest in alphabetical/lexicographical order.

Input

The first line contains the number of scenarios.
For each scenario, the first line contains the number n of strings with 1 <= n <= 15. Then these strings with 1 <= length <= 100 follow, one on each line, and they consist of the letters “A”, “C”, “G”, and “T” only.

Output

The output for every scenario begins with a line containing “Scenario #i:”, where i is the number of the scenario starting at 1. Then print a single line containing the shortest (and smallest) string as described above. Terminate the output for the scenario with a blank line.

Sample Input

1
2
TGCACA
CAT

Sample Output

Scenario #1:
TGCACAT

题目大概意思：

给出 $n(1≤n≤15)$ 条长度不超过 $100$ 的字符串，要求找到一条最短的字符串 $X$ ，使得这 $n$ 条字符串均为 $X$ 的子串。如果存在多条最短的，则 $X$ 为其中字典序最小的一条。

分析：

首先考虑如何能使构造出的 $X$ 的长度最短。

如果所有字符串已经是 $X$ 的子串了，那么 $X$ 就没有必要再增加新的字符了，也就是说 $X$ 应恰好包含这 $n$ 条字符串；
如果有两个字符串 $S_u$ 与 $S_v$ ，存在 $S_u$ 是 $S_v$ 的子串，那么 $X$ 只需包含 $S_v$ 即可；
如果有两个字符串 $S_u$ 与 $S_v$ ， $S_u$ 的后缀与 $S_v$ 的前缀存在若干个匹配，则我们可以选择让 $S_u$ 的后缀与 $S_v$ 的前缀最大程度地重叠来减小 $X$ 的长度。

对于第 $1$ 条，在构造 $X$ 的过程中会自然满足。

对于第 $2$ 条，我们可以通过预处理，删除那些是其它字符串的子串的字符串：

简单的方法是在 $O(n^2)$ 的时间复杂度内枚举每一对字符串，并在 $O(L)$ 的时间复杂度内枚举较短的字符串在较长的字符串中的匹配位置，并在不超过 $O(L)$ 的时间复杂度内判断是否是子串。这样的时间复杂度是 $O(n^2L^2)$ . 不过我们可以运用字符串哈希的算法，在 $O(1)$ 的时间复杂度内判断是否是子串。这样，删除那些已经是其它字符串的子串的字符串的时间复杂度为 $O(n^2L)$ .

对于第 $3$ 条，容易证明在最短的 $X$ 中，字符串的重叠部分长度的总和是最大的。那么如何找到这个最大的重叠长度呢？

首先我们可以预处理出任意两个字符串的前缀与后缀的最大匹配长度：

简单的方法是在 $O(n^2)$ 的时间复杂度内枚举每一个字符串的序偶，在 $O(L)$ 的时间复杂度内枚举匹配长度，并在不超过 $O(L)$ 的时间复杂度内判断前缀与后缀是否匹配。这样的时间复杂度是 $O(n^2L^2)$ . 不过同样地，我们可以运用字符串哈希的算法，在 $O(1)$ 的时间复杂度内判断前缀与后缀是否匹配。这样，预处理出每一序偶的匹配长度的时间复杂度为 $O(n^2L)$ .

接下来我们考虑如何找到重叠部分长度的总和最大的 $X$ .

朴素的方法是枚举所有可能的字符串的排列。所有可能的排列共有 $n!$ 种。这是一个非常大的值，即使在本题中 $n$ 已经很小了，但 $n!$ 的大小仍会达到 $10^{12}$ ，无法遍历每一种情况。

这时我们联想到典型的旅行商问题，这是一个 $NP-hard$ 问题。对于这个问题，朴素的方法一样是枚举所有可能的路径，如果节点数为 $n$ ，时间复杂度为 $O(n!)$ ，但我们通过运用状态压缩的动态规划算法，可以将时间复杂度降低为 $O(2^n·n^2)$ .

在解决旅行商问题时，我们使用 $dp[S][v]$ 表示以 $v$ 为出发点，访问顶点集合 $S$ 所需的最小路程，由于旅行商问题要求最终回到出发点，故初始时把起点当作未访问的节点，则递推式为：
$\begin{aligned} dp[0][v] =&0\\ dp[S][v] =&min\{dp[S-\{u\}][u]+dist(v,u)|u\in{S}\} \end{aligned}$

考虑当前的问题，如果我们把每一个字符串看作一个节点，那么从节点 $A$ 到节点 $B$ 的距离就可以等效为
$Len_B-Match(A,B)$
即字符串 $B$ 的长度减去字符串 $A$ 的后缀与字符串 $B$ 的前缀的最大匹配长度。于是，我们就可以使用 $dp[S][v]$ 表示以 $v$ 为起点，访问顶点集合 $S$ 所需的最小路程，对应在构造出的字符串上就是以字符串 $v$ 为首，包含的子串集合为 $S$ 时的最短长度。于是与旅行商问题类似的，我们可以得到递推式：

$\begin{aligned} dp&[\{v\}][v]=Len_v\\ dp&[S][v]=min\{dp[S-\{u\}][u]+Len_u-Match(v,u)|u∈S\} \end{aligned}$

只是在这里，我们把 $dp[\{v\}][v]$ 初始化为 $Len_v$ ，表示只有字符串 $v$ 时的最小长度即为字符串 $v$ 本身的长度。

最后，题目要求找到字典序最小的 $X$ ，由于我们的 $dp$ 数组记录着长度和字符串首的数据，很方便于我们得到最小长度和字字典序最小的字符串。这时只需要沿着 $dp$ 数组寻找字典序最小路径来构造 $X$ ，具体做法是：

首先遍历 $dp[S_{all}][i]$ ，记录下最小的长度和这个长度对应的 $i$ ，如果长度相等，则记录 $Str_i$ 字典序较小的。遍历完成后以记录下的这个字符串就是所求字符串 $X$ 的串首，我们以它为构造出的解的开头，设为 $X'$ .
遍历 $dp[S-\{i\}][i]$ ，其中 $S$ 是当前 $X'$ 不包含的字符串的集合， $i∈S$ ，设 $x$ 为当前 $X'$ 的最末端的字符串的序号，则遍历过程中记录下满足 $dp[S-\{i\}][i]+Len_x-Match(x,i)=dp[S][x]$ 的 $i$ ，如果长度相等，则记录除去 $Str_x$ 的后缀与 $Str_i$ 的前缀的最大匹配部分后的剩余部分的字典序最小的，最后将 $Str_i$ 的剩余部分添加至 $X'$ 的末尾。
若 $S=Ø$ 则解已构造完成，否则赋 $x=i,S=S-\{i\}$ ，返回第 $2$ 步。

$dp$ 数组共有 $2^n×n$ 个状态，每个状态由其它 $n$ 个状态转移而来，故动态规划的时间复杂度为 $O(2^n·n^2)$ ；构造解的过程中，共添加了 $n$ 次字符串，每次添加时对不超过 $n$ 对字符串比较了大小，每次比较大小的时间复杂度不超过 $O(L)$ ，故构造解的时间复杂度为 $O(n^2L)$ .

算上预处理，算法的总时间复杂度为 $O(2^n·n^2+n^2L)$ ，在时限内解决问题绰绰有余。

下面贴代码：

#include <cstdio>
#include <string>
#include <algorithm>
#include <vector>
using namespace std;

typedef unsigned long long ull;

ull B = 54788567;         // 用来字符串哈希的质数
const int INF = 1 << 16;
const int MAX_N = 15;
const int MAX_LEN = 102;

char ch[MAX_LEN];
vector<string> S;
int sublen[MAX_N][MAX_N]; // sublen[i][j] : 字符串 i 的后缀与字符串 j 的前缀的最大匹配长度
int dp[1 << MAX_N][MAX_N];// dp[i][j]     : 已选字符串集合为 i, 最左端字符串为 j 时的最短长度

string solve(int n);
bool contain(const char* const& str1, const char* const& str2);
int overlap(const char* const& str1, const char* const& str2);

int main()
{
	int T, n;
	scanf("%d", &T);

	for (int casen = 1; casen <= T; ++casen)
	{
		S.clear();
		scanf("%d", &n);
		for (int i = 0; i < n; ++i)
		{
			scanf(" \n%s", ch);
			S.push_back(ch);
		}
		printf("Scenario #%d:\n%s\n\n", casen, solve(n).c_str());
	}

	return 0;
}

string solve(int n)
{
	for (int i = 0; i < n; ++i)
	{
		for (int j = 0; j < n; ++j)
		{
			if (i == j) continue;
			if (contain(S[i].c_str(), S[j].c_str()))// 判断第 i 个字符串是不是第 j 个字符串的字串
			{                                       // 如果是, 删除第 i 个字符串
				S.erase(S.begin() + i);
				--i;
				--n;
				break;
			}
		}
	}
	for (int i = 0; i < n; ++i)// 预处理出每对字符串后缀与前缀的匹配长度
	{
		for (int j = 0; j < n; ++j)
		{
			if (i == j) continue;
			sublen[i][j] = overlap(S[i].c_str(), S[j].c_str());
		}
	}
	for (int i = 0; i < 1 << n; ++i)
	{
		fill(dp[i], dp[i] + n, INF);
	}

	for (int i = 0; i < n; ++i)
	{
		dp[1 << i][i] = S[i].length();
	}
	for (int i = 0; i < 1 << n; ++i)
	{
		for (int j = 0; j < n; ++j)
		{
			if (!(i >> j & 1)) continue;
			const int& curdp = dp[i][j];

			for (int k = 0; k < n; ++k)
			{
				if (i >> k & 1) continue;
				int& nxtdp = dp[i | (1 << k)][k];

				if (curdp + S[k].length() - sublen[k][j] < nxtdp)
				{
					nxtdp = curdp + S[k].length() - sublen[k][j];
				}
			}
		}
	}

	int minl = INF, minn;
	int s1 = (1 << n) - 1;
	for (int i = 0; i < n; ++i)
	{
		if (dp[s1][i] < minl || dp[s1][i] == minl && S[i] < S[minn])
		{
			minl = dp[s1][i];
			minn = i;
		}
	}
	string res = S[minn];
	for (int s2 = s1 & (~(1 << minn)); s2; s1 = s2, s2 = s1 & (~(1 << minn)))
	{
		string tmp = "z";// 先将 tmp 初始化为最大的字符串
		int nxt;
		for (int k = 0; k < n; ++k)
		{
			if ((s1 >> k & 1) && (dp[s1][minn] == dp[s2][k] + S[minn].length() - sublen[minn][k]))// 如果第 k 个字符串可能作下一个字符串
			{
				string m = S[k].substr(sublen[minn][k]);
				if (m < tmp)
				{
					tmp = m;
					nxt = k;
				}
			}
		}
		res += tmp;
		minn = nxt;
	}

	return res;
}

// 运用字符串哈希判断字串
// str1 是 str2 的子串返回 true, 否则返回 false
bool contain(const char* const& str1, const char* const& str2)
{
	int len1 = strlen(str1);
	int len2 = strlen(str2);
	if (len1 <= len2)
	{
		ull h1 = 0, h2 = 0, t = 1;
		for (int i = 0; i < len1; ++i)
		{
			h1 = h1 * B + str1[i];
			h2 = h2 * B + str2[i];
			t *= B;
		}
		for (int i = 0; i + len1 < len2; ++i)
		{
			if (h1 == h2)
			{
				return true;
			}
			h2 = h2 * B + str2[i + len1] - str2[i] * t;
		}
	}
	return false;
}

// 运用字符串哈希判断后缀与前缀最大匹配长度
// str1 的后缀与 str2 的前缀的最大匹配长度
int overlap(const char* const& str1, const char* const& str2)
{
	int len1 = strlen(str1);
	int len2 = strlen(str2);

	int mab = min(len1, len2);
	int res = 0;
	ull h1 = 0, h2 = 0, t = 1;
	for (int i = 1; i <= mab; ++i)
	{
		h1 = h1 + str1[len1 - i] * t;
		h2 = h2 * B + str2[i - 1];
		if (h1 == h2)
		{
			res = i;
		}
		t *= B;
	}
	return res;
}

xhxhxhxhx

原创文章 42 获赞 22 访问量 3044

关注私信

POJ1795 DNA Laboratory - 状态压缩 - 动态规划(dp) - 字典序最小dp路径 - 旅行商问题