目的

将docx文档中的正文、表格、图片按出现的顺序提取出来。

做法

尝试了两种思路：

1.paragraphs.runs

创建Document对象，利用正则表达式匹配相应部分，结果储存在一个列表；
这样可以利用paragraphs的runs参数，将正文中的图片按序输出，但是处理不了表格。
code：

# 可以将正文和图片顺序提取，无法处理表格中的图片，输入文件路径，返回字符串
def get_text_picture_not_table(file_name):
    doc = Document(file_name)
    a = list()
    pattern = re.compile('rId\d+')

    for graph in doc.paragraphs:
        b = list()
        for run in graph.runs:
            if run.text != '':
                b.append(run.text)

            else:
                # b.append(pattern.search(run.element.xml))
                contentID = pattern.search(run.element.xml).group(0)

                try:
                    contentType = doc.part.related_parts[contentID].content_type
                except KeyError as e:
                    print(e)
                    continue

                if contentType.startswith('image'):
                    imgName = basename(doc.part.related_parts[contentID].partname)
                    imgData = doc.part.related_parts[contentID].blob

                    with open(f'/result/temp', "wb") as f:
                        f.write(imgData)

                    text = str(pick_method.get_picture_text('/result/temp'))

                    b.append(text)

        a.append(b)

    all_text = ''

    for i in a:
        # print(i)
        for j in i:
            all_text = all_text + j

    # 表格
    tables = doc.tables  # 获取文件中的表格集
    table = tables[0]  # 获取文件中的第一个表格
    for i in range(0, len(table.rows)):  # 从表格第一行开始循环读取表格数据
        result = f'{(table.cell(i, 0).text):<5}'
        # cell(i,0)表示第(i+1)行第1列数据,以此类推
        all_text = all_text+result

    return all_text

2.part._rels

利用docx文件的性质（本质是压缩文件），找到里边的链接文件rels，尝试解析各个xml；
图片可以通过二进制的方式实现存取，但是正文和表格的相关信息无法解析。
code：

# 能够区分正文、图片、表格的位置，做不到提取所有，只能提取图片
# 有时间和必要了再继续做
def get_all_in_docx(word_path, result_path):
    r_name = pick_method.get_file_name(word_path)
    with open(result_path+r_name+'.txt', "w") as r_f:
        doc = docx.Document(word_path)
        dict_rel = doc.part._rels
        for rel in dict_rel:
            rel = dict_rel[rel]
            print(rel)

            if "image" in rel.target_ref:
                if not os.path.exists(result_path):
                    os.makedirs(result_path)
                img_name = re.findall("/(.*)", rel.target_ref)[0]
                word_name = os.path.splitext(word_path)[0]
                if os.sep in word_name:
                    new_name = word_name.split('\\')[-1]
                else:
                    new_name = word_name.split('/')[-1]
                img_name = f'{new_name}_{img_name}'
                with open(f'{result_path}/{img_name}', "wb") as f:
                    f.write(rel.target_part.blob)

                # 文件名的处理有缺陷：输入的result_path路径最后必须有/或\
                # pick_method.for_picture(pick_method.get_path(result_path)+'\\'+img_name, result_path+r_name+'.txt')

                r_f.write(pick_method.get_picture_text(f'{result_path}/{img_name}'))

            elif 'notes' in rel.target_ref:
                print('t')
                print(rel)
                #r_f.write()
            else:
                print('p')
                print(rel.target_part.blob)
                #r_f.write()

    r_f.close()

暂时做到这一步。

docx元素按序提取

目的

做法

1.paragraphs.runs

2.part._rels

猜你喜欢