使用评价指标工具

评估一个训练好的模型需要评估指标,比如正确率、查准率、查全率、F1值等。当然不同的任务类型有着不同的评估指标,而HuggingFace提供了统一的评价指标工具。

1.列出可用的评价指标
通过list_metrics()函数列出可用的评价指标:

def list_metric_test():
    # 第4章/列出可用的评价指标
    from datasets import list_metrics
    metrics_list = list_metrics()
    print(len(metrics_list), metrics_list[:5])

输出结果如下所示:

157 ['accuracy', 'bertscore', 'bleu', 'bleurt', 'brier_score']

可见目前包含157个评价指标,并且输出了前5个评价指标。

2.加载一个评价指标
通过load_metric()加载评价指标,需要说明的是有的评价指标和对应的数据集配套使用,这里以glue数据集的mrpc子集为例:

def load_metric_test():
    # 第4章/加载评价指标
    from datasets import load_metric
    metric = load_metric(path="accuracy") #加载accuracy指标
    print(metric)

    # 第4章/加载一个评价指标
    from datasets import load_metric
    metric = load_metric(path='glue', config_name='mrpc') #加载glue数据集中的mrpc子集
    print(metric)

3.获取评价指标的使用说明
评价指标的inputs_description属性描述了评价指标的使用方法,以及评价指标的使用方法如下所示:

def load_metric_description_test():
    # 第4章/加载一个评价指标
    from datasets import load_metric
    glue_metric = load_metric('glue', 'mrpc')  # 加载glue数据集中的mrpc子集
    print(glue_metric.inputs_description)

    references = [0, 1]
    predictions = [0, 1]
    results = glue_metric.compute(predictions=predictions, references=references)
    print(results)  # {'accuracy': 1.0, 'f1': 1.0}

输出结果如下所示:

Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0, 'f1': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'stsb')
    >>> references = [0., 1., 2., 3., 4., 5.]
    >>> predictions = [0., 1., 2., 3., 4., 5.]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
    {'pearson': 1.0, 'spearmanr': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'cola')
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'matthews_correlation': 1.0}

{'accuracy': 1.0, 'f1': 1.0}

首先描述了评价指标的使用方法,然后计算评价指标accuracy和f1。

猜你喜欢

转载自blog.csdn.net/shengshengwang/article/details/131427026