PTA 05-树9 Huffman Codes 题目分析及建树最优解法完全解析 c语言

PTA-mooc完整题目解析及AC代码库：PTA（拼题A）-浙江大学中国大学mooc数据结构2020年春AC代码与题目解析（C语言）

In 1953, David A. Huffman published his paper “A Method for the Construction of Minimum-Redundancy Codes”, and hence printed his name in the history of computer science. As a professor who gives the final exam problem on Huffman codes, I am encountering a big problem: the Huffman codes are NOT unique. For example, given a string “aaaxuaxz”, we can observe that the frequencies of the characters ‘a’, ‘x’, ‘u’ and ‘z’ are 4, 2, 1 and 1, respectively. We may either encode the symbols as {‘a’=0, ‘x’=10, ‘u’=110, ‘z’=111}, or in another way as {‘a’=1, ‘x’=01, ‘u’=001, ‘z’=000}, both compress the string into 14 bits. Another set of code can be given as {‘a’=0, ‘x’=11, ‘u’=100, ‘z’=101}, but {‘a’=0, ‘x’=01, ‘u’=011, ‘z’=001} is NOT correct since “aaaxuaxz” and “aazuaxax” can both be decoded from the code 00001011001001. The students are submitting all kinds of codes, and I need a computer program to help me determine which ones are correct and which ones are not.

Input Specification:

Each input file contains one test case. For each case, the first line gives an integer N (2≤N≤63), then followed by a line that contains all the N distinct characters and their frequencies in the following format:

c[1] f[1] c[2] f[2] ... c[N] f[N]

where c[i] is a character chosen from {‘0’ - ‘9’, ‘a’ - ‘z’, ‘A’ - ‘Z’, ‘_’}, and f[i] is the frequency of c[i] and is an integer no more than 1000. The next line gives a positive integer M (≤1000), then followed by M student submissions. Each student submission consists of N lines, each in the format:

c[i] code[i]

where c[i] is the i-th character and code[i] is an non-empty string of no more than 63 '0’s and '1’s.

Output Specification:

For each test case, print in each line either “Yes” if the student’s submission is correct, or “No” if not.

Note: The optimal solution is not necessarily generated by Huffman algorithm. Any prefix code with code length being optimal is considered correct.

Sample Input:

7
A 1 B 1 C 1 D 3 E 3 F 6 G 6
4
A 00000
B 00001
C 0001
D 001
E 01
F 10
G 11
A 01010
B 01011
C 0100
D 011
E 10
F 11
G 00
A 000
B 001
C 010
D 011
E 100
F 101
G 110
A 00000
B 00001
C 0001
D 001
E 00
F 10
G 11

Sample Output:

Yes
Yes
No
No

题目分析

对于同样一串带有权重的字符集，可以生成对于某个字符不同编码方式的Huffman编码集。如题中举出的例子：对于’a’, ‘x’, ‘u’ 和 ‘z’，其各自频率为4, 2, 1和1，最后生成的Huffman编码集既可以是{‘a’=0, ‘x’=10, ‘u’=110, ‘z’=111}，也可以是{‘a’=1, ‘x’=01, ‘u’=001, ‘z’=000}，或者是还有其他的方式。

题目要求给定一串样例字符及其各自频率，然后给定若干次提交，每次提交是各样例字符对应的编码，需要判断每次提交的各字符编码是否为正确的Huffman编码。

注意：这里题目中每次提交中的各字符编码顺序与样例顺序相同，因此可以不存储各字符内容，只存储样例字符各自的频率即可，针对每次提交进行判断时，只需要按照索引取出即可。

各类解法（仅举建树解法）

对于同一个带频率的字符集来说，虽然有各自Huffman编码方式，但其WPL（加权路径长度）是相同的（这个我最初没有想到，也是从网上搜集资料学到的）。

因此，整个程序可以分为两个大步骤：

求得带频率的样例字符集对应的Huffman树的WPL
对于每次提交，根据样例字符集的WPL判断各编码是否对应一棵结构正确的Huffman树

在第二步中，需要判断的条件有三个：

该次提交的各编码WPL与预设字符集的完全相同；
没有前缀码的情况出现
没有度为1的结点（Huffman树中不存在度为1的结点）

步骤一的解法

这里有两种方法：

先按照频率大小将字符排序，然后取出频率最小的两个字符结合成一个新的频率结点然后重新找到合适的位置插入回去，时间复杂度是O(N^2)
使用优先级队列（即堆）来处理。先根据频率大小生成小顶堆，然后取两次堆顶元素（即最小频率结点）加和生成新的频率结点并重新插入进堆中。实现过程较第一种复杂，但时间复杂度为O(N*logN)

步骤二的解法

因为在步骤二需要判断的三个条件中，条件1和3均可以在计算条件2时同时计算得到，并且条件2的判断操作是该步的主要复杂项，因此主要考虑针对条件2的操作

同样有两类方法（设有M次提交）：

对于每次提交，使用双重for循环判断每个编码和该次提交中所有其他字符编码是否有前缀码的情况出现，时间复杂度为O(M*N^2)
对于每次提交，使用一棵树来判断前缀码情况。如果某次编码对应的树结点是从已有叶结点生出来的，则说明之前某个字符编码是该编码的前缀码；如果某次编码对应的结点有孩子结点，则说明该编码是之前某个字符编码的前缀码。总体时间复杂度为O(M*N)

注：上述两种方法的时间复杂度中均未考虑每个字符的编码长度

建树最优解法代码重点函数说明

main函数：首先用readPair函数读入输入字符及频率，存到一个频率数组中，然后使用HuffmanTreeWPL函数计算WPL值，最后对m次提交分别执行check_code函数判断是否正确

HuffmanTreeWPL函数：首先根据读入的频率数组建立小顶堆，然后执行n-1次合并操作，每次合并取两次堆顶最小元素合并为一个新的树结点，然后插入回堆中。这里在每次合并时就可累计计算WPL值，无需在建树后重新计算，执行结束后的WPL值即为最终结果

check_code函数：这里对于需要判断的三个条件都有各自的处理。

判断WPL是否相同使用每个编码长度乘以其频率的累加和来判断
判断前缀码问题使用结点关系来判断，此处通过使用RecoverHFTreeByCode函数来改变flag值进行标识
判断是否存在度为1的结点使用计算生成的Huffman树总结点数来计算（正确的Huffman树结点数为2*n-1，n为叶子结点数，即字符数）

最后的时间复杂度约等于O(N*logN+M*N)

c语言实现：

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MINH -1
#define MAXCODESIZE 63

typedef struct TreeNode *HuffmanTree;
struct TreeNode {
    int weight;
    HuffmanTree Left, Right;
};

typedef struct TreeNode *ElementType;
typedef struct HeapStruct *MinHeap;
struct HeapStruct {
    ElementType *Elements;
    int Size;
    int Capacity;
};

/* 堆的相关操作：开始 */
MinHeap CreateHeap( int MaxSize )
{
    MinHeap H = (MinHeap)malloc(sizeof(struct HeapStruct));
    H->Elements = malloc((MaxSize + 1) * sizeof(ElementType));
    H->Size = 0;
    H->Capacity = MaxSize;
    H->Elements[0] = (ElementType)malloc(sizeof(struct TreeNode));
    H->Elements[0]->weight = MINH;

    return H;
}

void DestoryHeap(MinHeap H)
{
    int i;
    if (!H) return;
    if (H->Elements) {
        for (i = 0; i < H->Size; ++i)
            free(H->Elements[i]);
        free(H->Elements);
    }
    free(H);
}

int IsFull(MinHeap H)
{
    return (H->Size == H->Capacity);
}

int IsEmpty(MinHeap H)
{
    return (H->Size == 0);
}

void Insert(MinHeap H, ElementType X)
{
    int i;
    if (IsFull(H)) return;
    i = ++H->Size;
    for (; H->Elements[i / 2]->weight > X->weight; i /= 2)
        H->Elements[i] = H->Elements[i / 2];
    H->Elements[i] = X;
}

ElementType DeleteMin(MinHeap H)
{
    int Parent, Child;
    ElementType MinItem, X;
    if (IsEmpty(H)) return 0;

    MinItem = H->Elements[1];
    X = H->Elements[H->Size--];
    for (Parent = 1; Parent * 2 <= H->Size; Parent = Child) {
        Child = Parent * 2;
        if ((Child != H->Size) && (H->Elements[Child]->weight > H->Elements[Child + 1]->weight))
            Child++;
        if (X->weight <= H->Elements[Child]->weight) break;
        else
            H->Elements[Parent] = H->Elements[Child];
    }
    H->Elements[Parent] = X;
    return MinItem;
}

MinHeap BuildMinHeap(int *freq_arr,int n)
{
    int i;
    ElementType data;
    MinHeap H;
    H = CreateHeap(n);

    for (i = 0; i < H->Capacity; ++i) {
        data = (ElementType)malloc(sizeof(struct TreeNode));
        data->weight = freq_arr[i];
        data->Left = 0; data->Right = 0;
        Insert(H, data);
    }
    return H;
}
/* 堆的相关操作：结束 */

void DestoryCharFreq(int *freq_arr)
{
    if (!freq_arr) return;
    free(freq_arr);
}

void DestoryHuffmanTree(HuffmanTree HT)
{
    if (!HT) return;
    DestoryHuffmanTree(HT->Left);
    DestoryHuffmanTree(HT->Right);
    free(HT);
}

int *readPair(int n)
{
    int i; char c;
    int *freq_arr;
    freq_arr = malloc(n * sizeof(int));
    for (i = 0; i < n; ++i) {
        scanf(" %c %d", &c, &freq_arr[i]);
    }
    return freq_arr;
}

int HuffmanTreeWPL(int *freq_arr, int n)
{
    int i, wpl = 0;
    MinHeap H;
    HuffmanTree T;
    H = BuildMinHeap(freq_arr, n);  // 根据读入数据建立小顶堆
    for (i = 1; i < H->Capacity; ++i) {
        T = (HuffmanTree)malloc(sizeof(struct TreeNode));
        T->Left = DeleteMin(H);
        T->Right = DeleteMin(H);
        T->weight = T->Left->weight + T->Right->weight;
        Insert(H, T);
        wpl += T->weight;
    }
    T = DeleteMin(H);
    DestoryHeap(H);
    DestoryHuffmanTree(T);    // 销毁Huffman树所占空间
    return wpl;
}

void init_codeArr(char *code)   // 初始化code数组每个元素为\0
{
    int i;
    for (i = 0; i < MAXCODESIZE && code[i] != '\0'; ++i)
        code[i] = '\0';
}

HuffmanTree createTreeNode()
{
    HuffmanTree HT;
    HT = (HuffmanTree)malloc(sizeof(struct TreeNode));
    HT->weight = 1; HT->Left = 0; HT->Right = 0;
    return HT;
}

// 这里使用struct TreeNode结构体，其中的weight用来表示是否为本次code新添加结点，若是则为1，否则为0
HuffmanTree RecoverHFTreeByCode(HuffmanTree HT, char *code, int *flag, int *counter)
{
    int i;  HuffmanTree node;
    if (!HT) {
        HT = createTreeNode();
        ++(*counter);
    }
    for (i = 0, node = HT; code[i] != '\0';  ++i) {
        // 第一种情况：该结点不是新添加结点，并且没有左右孩子，即之前字符对应的子节点
        // 说明之前某字符编码是该编码的前缀码
        if (node->weight == 0 && !node->Left && !node->Right) {
            (*flag) = 0;
            break;
        }
        node->weight = 0;
        if (code[i] == '0') {   // 读到0向左孩子走一位
            if (!node->Left) {
                node->Left = createTreeNode();
                ++(*counter);
            }
            node = node->Left;
        } else {
            if (!node->Right) {
                node->Right = createTreeNode();
                ++(*counter);
            }
            node = node->Right;
        }
    }
    // 第二种情况：读完所有code后，该位置有孩子结点，说明该编码是之前某字符编码的前缀码
    if (node->Left || node->Right)
        (*flag) = 0;
    return HT;
}

void check_code(int *freq_arr, int n, int wpl)
{
    int i, sum_wpl, flag, counter; char c;
    HuffmanTree HT;
    char code[MAXCODESIZE] = "\0";
    sum_wpl = 0; flag = 1; counter = 0; HT = 0;
    for (i = 0; i < n; ++i) {
        init_codeArr(code);
        scanf("\n%c %s", &c, code);
        if (flag) { // 判断该次提交是否已经不正确了，如果还正确则继续处理
            sum_wpl += strlen(code) * freq_arr[i];
            HT = RecoverHFTreeByCode(HT, code, &flag, &counter);
        }
    }
    // 这里有三个判断条件
    // 1. 提交的总wpl与预设的完全相同
    // 2. 没有前缀码情况出现，此处用flag标识
    // 3. 没有度为1的结点，此处的counter表示生成的Huffman树结点数，如果正确应该等于2*n-1，n为叶节点个数，即所有字符数
    if (sum_wpl == wpl && flag && counter == 2 * n - 1)
        printf("Yes\n");
    else
        printf("No\n");
    DestoryHuffmanTree(HT);
}

int main()
{
    int n, m, i, wpl;
    int *freq_arr;
    scanf("%d\n", &n);

    freq_arr = readPair(n);     // 读入所有字符及其对应频率
    wpl = HuffmanTreeWPL(freq_arr, n);    // 根据读入字符及频率使用小顶堆建立Huffman树求得WPL值

    scanf("%d", &m);
    for (i = 0; i < m; ++i) {
        check_code(freq_arr, n, wpl);
    }

    DestoryCharFreq(freq_arr);

    return 0;
}

北顾.岛城

发布了14 篇原创文章 · 获赞 18 · 访问量 1万+

私信关注