SCIERC语料格式解读

一、观察语料

{"clusters": [[[17, 20], [23, 23]]], "sentences": [["English", "is", "shown", "to", "be", "trans-context-free", "on", "the", "basis", "of", "coordinations", "of", "the", "respectively", "type", "that", "involve", "strictly", "syntactic", "cross-serial", "agreement", "."], ["The", "agreement", "in", "question", "involves", "number", "in", "nouns", "and", "reflexive", "pronouns", "and", "is", "syntactic", "rather", "than", "semantic", "in", "nature", "because", "grammatical", "number", "in", "English", ",", "like", "grammatical", "gender", "in", "languages", "such", "as", "French", ",", "is", "partly", "arbitrary", "."], ["The", "formal", "proof", ",", "which", "makes", "crucial", "use", "of", "the", "Interchange", "Lemma", "of", "Ogden", "et", "al.", ",", "is", "so", "constructed", "as", "to", "be", "valid", "even", "if", "English", "is", "presumed", "to", "contain", "grammatical", "sentences", "in", "which", "respectively", "operates", "across", "a", "pair", "of", "coordinate", "phrases", "one", "of", "whose", "members", "has", "fewer", "conjuncts", "than", "the", "other", ";", "it", "thus", "goes", "through", "whatever", "the", "facts", "may", "be", "regarding", "constructions", "with", "unequal", "numbers", "of", "conjuncts", "in", "the", "scope", "of", "respectively", ",", "whereas", "other", "arguments", "have", "foundered", "on", "this", "problem", "."]], "ner": [[[0, 0, "Material"], [10, 10, "OtherScientificTerm"], [17, 20, "OtherScientificTerm"]], [[23, 23, "Generic"], [29, 29, "OtherScientificTerm"], [31, 32, "OtherScientificTerm"], [42, 43, "OtherScientificTerm"], [45, 45, "Material"], [48, 49, "OtherScientificTerm"], [51, 51, "Material"], [54, 54, "Material"]], [[70, 71, "Method"], [86, 86, "Material"]]], "relations": [[], [[29, 29, 31, 32, "CONJUNCTION"], [48, 49, 51, 51, "FEATURE-OF"], [54, 54, 51, 51, "HYPONYM-OF"]], []], "doc_key": "J87-1003"}

二、利用下面的代码将语料打印出来

import json

gold_docs = [json.loads(line) for line in open('scierc_data/processed_data/json/train.json')]
# print(gold_docs)

for i in gold_docs:
    print('集群:', i['clusters'])
    print("句子:", i['sentences'])
    print("实体信息:", i['ner'])
    print("关系对:", i["relations"])
    print("文章编号:", i["doc_key"])
    all_sentences = []
    for j in i["sentences"]:
        all_sentences += j
    break

集群: [[[17, 20], [23, 23]]]


句子: [['English', 'is', 'shown', 'to', 'be', 'trans-context-free', 'on', 'the', 'basis', 'of', 'coordinations', 'of', 'the', 'respectively', 'type', 'that', 'involve', 'strictly', 'syntactic', 'cross-serial', 'agreement', '.'], ['The', 'agreement', 'in', 'question', 'involves', 'number', 'in', 'nouns', 'and', 'reflexive', 'pronouns', 'and', 'is', 'syntactic', 'rather', 'than', 'semantic', 'in', 'nature', 'because', 'grammatical', 'number', 'in', 'English', ',', 'like', 'grammatical', 'gender', 'in', 'languages', 'such', 'as', 'French', ',', 'is', 'partly', 'arbitrary', '.'], ['The', 'formal', 'proof', ',', 'which', 'makes', 'crucial', 'use', 'of', 'the', 'Interchange', 'Lemma', 'of', 'Ogden', 'et', 'al.', ',', 'is', 'so', 'constructed', 'as', 'to', 'be', 'valid', 'even', 'if', 'English', 'is', 'presumed', 'to', 'contain', 'grammatical', 'sentences', 'in', 'which', 'respectively', 'operates', 'across', 'a', 'pair', 'of', 'coordinate', 'phrases', 'one', 'of', 'whose', 'members', 'has', 'fewer', 'conjuncts', 'than', 'the', 'other', ';', 'it', 'thus', 'goes', 'through', 'whatever', 'the', 'facts', 'may', 'be', 'regarding', 'constructions', 'with', 'unequal', 'numbers', 'of', 'conjuncts', 'in', 'the', 'scope', 'of', 'respectively', ',', 'whereas', 'other', 'arguments', 'have', 'foundered', 'on', 'this', 'problem', '.']]


实体信息: [[[0, 0, 'Material'], [10, 10, 'OtherScientificTerm'], [17, 20, 'OtherScientificTerm']], [[23, 23, 'Generic'], [29, 29, 'OtherScientificTerm'], [31, 32, 'OtherScientificTerm'], [42, 43, 'OtherScientificTerm'], [45, 45, 'Material'], [48, 49, 'OtherScientificTerm'], [51, 51, 'Material'], [54, 54, 'Material']], [[70, 71, 'Method'], [86, 86, 'Material']]]


关系对: [[], [[29, 29, 31, 32, 'CONJUNCTION'], [48, 49, 51, 51, 'FEATURE-OF'], [54, 54, 51, 51, 'HYPONYM-OF']], []]


文章编号: J87-1003

三、解读

由上面打印的信息可知,句子对就是文档中的句子,这里是以列表的形式给出来的。实体信息就是实体在前面文档的中(起始位置,终止位置,实体类型)三部分构成。关系对以(主体起始位置,主体终止位置,客体起始位置,客体终止位置,关系类型构成)五个部分构成。

最后的集群指的是同一个指代的实体在文章中出现的不同位置,已用(起始位置,终止位置)的形式给出。

猜你喜欢

转载自blog.csdn.net/qq_38901850/article/details/125075286