Repeated DNA Sequences

在给定字符串中寻找重复出现的序列，每个序列长度为10

可以采用unordered_map记录每个序列出现的个数，将出现超过一次的添加到结果集中

代码如下

class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
        if(s.size() < 10)   return {};

        vector<string> res;
        unordered_map<string, int> hash;
        size_t first = 0;
        size_t last = 10;
        while(last <= s.size())
        {
            auto str = s.substr(first, last - first);
            if(hash[str] == 1)
                res.emplace_back(str);
            ++hash[str];
            ++first;
            ++last;
        }
        return res;
    }
};

但是这种方法每次都需要调用substr获取子串，容易造成性能瓶颈，有什么方法不用调用substr也能判断当前的这个子串出现过呢

由于规定了子串长度为10，而且子串中只能出现”AGCT“四个字符中的一个，那么可以考虑用20个bit来表示长度为10的子串，其中每个字符占两bit。随后采用滑动窗口的思想，新到的字符添加到20bit的低位，溢出的字符丢掉

代码如下

class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
        if(s.size() < 10)   return {};

        vector<string> res;
        unordered_map<int, int> hash;
        int val = 0;
        /* 掩码，用于将左溢出的两位清零 */
        int mask = (1 << 20) - 1;
        /* 每个字符占两位，toBit要保证能区分开四个字符 */
        for(int i = 0; i < 10; ++i)
            val = (val << 2) | toBit(s[i]);
        hash[val] = 1;
        for(int i = 10; i < s.size(); ++i)
        {
            val = ((val << 2) | toBit(s[i])) & mask;
            if(hash[val] == 1)
                res.emplace_back(s.substr(i - 10 + 1, 10));
            ++hash[val];
        }
        return res;
    }
private:
    int toBit(char ch)
    {
        switch(ch)
        {
            case 'A':
                return 0;
            case 'G':
                return 1;
            case 'C':
                return 2;
            case 'T':
                return 3;
        }
    }
};

每天一道LeetCode-----寻找给定字符串中重复出现的子串

Repeated DNA Sequences

猜你喜欢