转战pytorch(3)——跟上脚步(以Albert为例)

前言

之前,我们已经通过认识基本的概念,以及通过一个简单的例子熟悉了pytorch的组成部分以及运行过程。本章,我们将会直接使用最先进的NLP库Transformers,以Albert为例构造我们的最先进的模型。
这里,我们以transformers官方的样例MM_IMDB进行代码的解读,并尝试搭建属于自己的文本分类模型。这里为什么我们使用transformers库呢,因为它使得很多最先进的模型都统一了接口,从而让我们更好的使用他们。

1. 初识

首先我们看到程序主要分为5个部分,分别是设置时间种子,训练过程,评估过程,加载数据,以及主过程,接下来我们将对每个部分进行一个细致的研究,重点在于其代码思路。

def set_seed(args)
def train(args, train_dataset, model, tokenizer, criterion)
def evaluate(args, model, tokenizer, criterion, prefix="")
def load_examples(args, tokenizer, evaluate=False)
def main()

1.1 设置种子

def set_seed(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)

设计随机种子的代码比较简单,其意义在于能够让我们的代码复现。无怪乎是以下四个步骤:

  1. 设置随机种子
  2. 将种子赋予np
  3. 将种子赋予torch
  4. 将种子赋予cuda

1.2 训练过程

训练过程很长,我们先看一下其参数有哪些,以及返回值是什么。参数主要有实验参数args,训练集train_dataset,模型model,解析器tokenizer以及损失函数criterion。返回值则是全局的步数global_step以及平均的损失tr_loss / global_step

def train(args, train_dataset, model, tokenizer, criterion):
    "省略主要代码"
    return global_step, tr_loss / global_step

下面我们再看细节代码,由于代码量大,我们只列举一些重要的部分代码。

1.2.1 训练加载器

train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
    train_dataloader = DataLoader(
        train_dataset,
        sampler=train_sampler,
        batch_size=args.train_batch_size,
        collate_fn=collate_fn,
        num_workers=args.num_workers,
    )

这几行代码中,训练数据采样器train_sampler和训练数据加载器train_dataloader都已经加载好了。主要目的就是将数据变为可以输入给模型的存储形式。一个重要的东西就是上篇文章所讲的collate_fn,它可以重构读取数据的样子,有点像数据格式化,下面的代码可以看到原文中的做法到底是在做什么,其实就是准备数据。

def collate_fn(batch):
    lens = [len(row["sentence"]) for row in batch]
    bsz, max_seq_len = len(batch), max(lens)

    mask_tensor = torch.zeros(bsz, max_seq_len, dtype=torch.long)
    text_tensor = torch.zeros(bsz, max_seq_len, dtype=torch.long)

    for i_batch, (input_row, length) in enumerate(zip(batch, lens)):
        text_tensor[i_batch, :length] = input_row["sentence"]
        mask_tensor[i_batch, :length] = 1

    img_tensor = torch.stack([row["image"] for row in batch])
    tgt_tensor = torch.stack([row["label"] for row in batch])
    img_start_token = torch.stack([row["image_start_token"] for row in batch])
    img_end_token = torch.stack([row["image_end_token"] for row in batch])

    return text_tensor, mask_tensor, img_tensor, img_start_token, img_end_token, tgt_tensor

1.2.2 设置优化器及规则

# Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]

    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )

这里主要设置的是优化器及其规则,例如优化器、学习参数,学习率等等。这里的准备是为后面训练做准备。

1.2.3 多GPU和分布式训练

# multi-gpu training (should be after apex fp16 initialization)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Distributed training (should be after apex fp16 initialization)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
        )

这里则是以后大有可用的多GPU和分布式训练,只需要硬件支持即可,硬件不支持,使用默认的参数也行。

1.2.4 训练过程

这里来重头戏了,真正的训练过程。由于代码比较长,我们还是分开讲解。

1.2.4.1 日志打印

# Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

这一步虽然可以省略,但是如果增加了这一步,有很多时候就清楚我们在干什么。

1.2.4.2 设置训练参数

	global_step = 0
    tr_loss, logging_loss = 0.0, 0.0
    best_f1, n_no_improve = 0, 0
    model.zero_grad()
    train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
    set_seed(args)  # Added here for reproductibility

这里没什么好说的部分,主要是为了训练准备一些参数。

1.2.4.3 迭代循环

这是整个训练过程的主要部分,可以看到,每个epoch和每个step形成的双重循环。

for _ in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
        for step, batch in enumerate(epoch_iterator):

然后是将数据以batch和索引的形式送给模型,这里模型使用的是outputs = model(**inputs),其中**inputs表示关键字参数,它本质上是一个 dict,这样我们在构建模型的时候,就可以以字典的形式给模型输入了。

			model.train()
            batch = tuple(t.to(args.device) for t in batch)
            labels = batch[5]
            inputs = {
                "input_ids": batch[0],
                "input_modal": batch[2],
                "attention_mask": batch[1],
                "modal_start_tokens": batch[3],
                "modal_end_tokens": batch[4],
            }
            outputs = model(**inputs)

接下来则是接收模型的输出以及损失函数的计算,transformer的输出是有格式的,输出为一个元组,其内容如文档所示。但是这里由于是一个自定义的模型输出,虽然输出的格式也是元组,但是内容有所不同,我们在模型部分就可以看到。

            logits = outputs[0]  # model outputs are always tuple in transformers (see doc)
            loss = criterion(logits, labels)

            if args.n_gpu > 1:
                loss = loss.mean()  # mean() to average on multi-gpu parallel training
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            tr_loss += loss.item()

接下来就是一些数据的打印输出,这里也能够看到验证结果的部分。

    				if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    logs = {}
                    if (
                        args.local_rank == -1 and args.evaluate_during_training
                    ):  # Only evaluate when single GPU otherwise metrics may not average well
                        results = evaluate(args, model, tokenizer, criterion)
                        for key, value in results.items():
                            eval_key = "eval_{}".format(key)
                            logs[eval_key] = value

                    loss_scalar = (tr_loss - logging_loss) / args.logging_steps
                    learning_rate_scalar = scheduler.get_lr()[0]
                    logs["learning_rate"] = learning_rate_scalar
                    logs["loss"] = loss_scalar
                    logging_loss = tr_loss

                    for key, value in logs.items():
                        tb_writer.add_scalar(key, value, global_step)
                    print(json.dumps({**logs, **{"step": global_step}}))

最后的部分则是保存模型的检查点,这里也能看到最规范的保存模型的方法。

    				if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
                    # Save model checkpoint
                    output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
                    if not os.path.exists(output_dir):
                        os.makedirs(output_dir)
                    model_to_save = (
                        model.module if hasattr(model, "module") else model
                    )  # Take care of distributed/parallel training
                    torch.save(model_to_save.state_dict(), os.path.join(output_dir, WEIGHTS_NAME))
                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
                    logger.info("Saving model checkpoint to %s", output_dir)

循环最后有两个及早停止,一个是如果你设置了最大步数,则会提起前停止,否则,则是看结果是否有提升,没有提升则停止。

            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
        if args.max_steps > 0 and global_step > args.max_steps:
            train_iterator.close()
            break

        if args.local_rank == -1:
            results = evaluate(args, model, tokenizer, criterion)
            if results["micro_f1"] > best_f1:
                best_f1 = results["micro_f1"]
                n_no_improve = 0
            else:
                n_no_improve += 1

            if n_no_improve > args.patience:
                train_iterator.close()
                break

1.3 评估过程

评估过程与训练过程类似,也包含了加载数据,评估数据,输出结果,以及保存结果4个部分。我们先从其参数入手:

def evaluate(args, model, tokenizer, prefix="")

这里主要的参数哦在args,剩下的model就是模型,tokenizer就是解析器,Prefix是前缀,就是标识符。

1.3.1 加载数据

	# Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_output_dir = args.output_dir
    eval_dataset = load_examples(args, tokenizer, evaluate=True)

    if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
        os.makedirs(eval_output_dir)

    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
    # Note that DistributedSampler samples randomly
    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate_fn
    )

    # multi-gpu eval
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

加载数据这块没什么区别,里面有一个load_examples,我们一会再详细解释。下面的评估过程也和训练过程差不多,主要是最后增加了一个结果的评估。

# Eval!
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    eval_loss = 0.0
    nb_eval_steps = 0
    preds = None
    out_label_ids = None
    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        model.eval()
        batch = tuple(t.to(args.device) for t in batch)

        with torch.no_grad():
            batch = tuple(t.to(args.device) for t in batch)
            labels = batch[5]
            inputs = {
                "input_ids": batch[0],
                "input_modal": batch[2],
                "attention_mask": batch[1],
                "modal_start_tokens": batch[3],
                "modal_end_tokens": batch[4],
            }
            outputs = model(**inputs)
            logits = outputs[0]  # model outputs are always tuple in transformers (see doc)
            tmp_eval_loss = criterion(logits, labels)
            eval_loss += tmp_eval_loss.mean().item()
        nb_eval_steps += 1
        if preds is None:
            preds = torch.sigmoid(logits).detach().cpu().numpy() > 0.5
            out_label_ids = labels.detach().cpu().numpy()
        else:
            preds = np.append(preds, torch.sigmoid(logits).detach().cpu().numpy() > 0.5, axis=0)
            out_label_ids = np.append(out_label_ids, labels.detach().cpu().numpy(), axis=0)

    eval_loss = eval_loss / nb_eval_steps
    result = {
        "loss": eval_loss,
        "macro_f1": f1_score(out_label_ids, preds, average="macro"),
        "micro_f1": f1_score(out_label_ids, preds, average="micro"),
    }

接下来也是结果的展示和输出。

output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
    with open(output_eval_file, "w") as writer:
        logger.info("***** Eval results {} *****".format(prefix))
        for key in sorted(result.keys()):
            logger.info("  %s = %s", key, str(result[key]))
            writer.write("%s = %s\n" % (key, str(result[key])))

    return result

1.4 加载数据

def load_examples(args, tokenizer, evaluate=False):
    path = os.path.join(args.data_dir, "dev.jsonl" if evaluate else "train.jsonl")
    transforms = get_image_transforms()
    labels = get_mmimdb_labels()
    dataset = JsonlDataset(path, tokenizer, transforms, labels, args.max_seq_length - args.num_image_embeds - 2)
    return dataset

这里没什么说道的部分,但是JsonlDataset需要说明一下,因为它继承自一个非常重要的基类Dataset。它与Dataloader是一对。这里只需要给出初始化__init__,长度函数__len__以及取元素函数__getitem__即可。这里的返回值就是一个样本的所有特征的字典,这一个个样本是在Dataloader里拼接后再输入到模型中的,而这个拼接过程就在于刚才讲到的collate_fn函数。

class JsonlDataset(Dataset):
    def __init__(self, data_path, tokenizer, transforms, labels, max_seq_length):
        self.data = [json.loads(l) for l in open(data_path)]
        self.data_dir = os.path.dirname(data_path)
        self.tokenizer = tokenizer
        self.labels = labels
        self.n_classes = len(labels)
        self.max_seq_length = max_seq_length

        self.transforms = transforms

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        sentence = torch.LongTensor(self.tokenizer.encode(self.data[index]["text"], add_special_tokens=True))
        start_token, sentence, end_token = sentence[0], sentence[1:-1], sentence[-1]
        sentence = sentence[: self.max_seq_length]

        label = torch.zeros(self.n_classes)
        label[[self.labels.index(tgt) for tgt in self.data[index]["label"]]] = 1

        image = Image.open(os.path.join(self.data_dir, self.data[index]["img"])).convert("RGB")
        image = self.transforms(image)

        return {
            "image_start_token": start_token,
            "image_end_token": end_token,
            "sentence": sentence,
            "image": image,
            "label": label,
        }

    def get_label_frequencies(self):
        label_freqs = Counter()
        for row in self.data:
            label_freqs.update(row["label"])
        return label_freqs

1.5 主函数

主函数一开头就给了一个parser解析的,这里包含很多,我们这里就省略了,直接从正文开始讲起。

1.5.1 加载模型

加载模型主要有以下几个关键步骤,就是先加载原始的transformer_config,tokenzier以及transformer模型后,再根据这些搭建自己的模型model和配置config。在第二节中,我们将会详细讲解这里的modelconfig是如何搭建的。

	# Setup model
    labels = get_mmimdb_labels()
    num_labels = len(labels)
    args.model_type = args.model_type.lower()
    config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
    transformer_config = config_class.from_pretrained(
        args.config_name if args.config_name else args.model_name_or_path
    )
    tokenizer = tokenizer_class.from_pretrained(
        args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
        do_lower_case=args.do_lower_case,
        cache_dir=args.cache_dir if args.cache_dir else None,
    )
    transformer = model_class.from_pretrained(
        args.model_name_or_path, config=transformer_config, cache_dir=args.cache_dir if args.cache_dir else None
    )
    img_encoder = ImageEncoder(args)
    config = MMBTConfig(transformer_config, num_labels=num_labels)
    model = MMBTForClassification(config, transformer, img_encoder)

    if args.local_rank == 0:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

    model.to(args.device)

    logger.info("Training/evaluation parameters %s", args)

1.5.2 训练过程

这里就是调用之前的训练函数的部分。这里不仅介绍了如何进行训练,而且还告诉我们如何保存和加载模型。

# Training
    if args.do_train:
        train_dataset = load_examples(args, tokenizer, evaluate=False)
        label_frequences = train_dataset.get_label_frequencies()
        label_frequences = [label_frequences[l] for l in labels]
        label_weights = (
            torch.tensor(label_frequences, device=args.device, dtype=torch.float) / len(train_dataset)
        ) ** -1
        criterion = nn.BCEWithLogitsLoss(pos_weight=label_weights)
        global_step, tr_loss = train(args, train_dataset, model, tokenizer, criterion)
        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

    # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
    if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
        # Create output directory if needed
        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
            os.makedirs(args.output_dir)

        logger.info("Saving model checkpoint to %s", args.output_dir)
        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
        model_to_save = (
            model.module if hasattr(model, "module") else model
        )  # Take care of distributed/parallel training
        torch.save(model_to_save.state_dict(), os.path.join(args.output_dir, WEIGHTS_NAME))
        tokenizer.save_pretrained(args.output_dir)

        # Good practice: save your training arguments together with the trained model
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Load a trained model and vocabulary that you have fine-tuned
        model = MMBTForClassification(config, transformer, img_encoder)
        model.load_state_dict(torch.load(os.path.join(args.output_dir, WEIGHTS_NAME)))
        tokenizer = tokenizer_class.from_pretrained(args.output_dir)
        model.to(args.device)

1.5.3 评估过程

这段似曾相识啊,没错,这个在评估函数里已经写了,只不过这里是先训练,后评估。之前的是边训练,边评估。

# Evaluation
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
            )
            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""
            model = MMBTForClassification(config, transformer, img_encoder)
            model.load_state_dict(torch.load(checkpoint))
            model.to(args.device)
            result = evaluate(args, model, tokenizer, criterion, prefix=prefix)
            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
            results.update(result)

2. 模型搭建

原封不动的使用原有的transformer的模型不好,肯定要根据我们的任务进行一定的改进。那么如何根据自己的任务进行模型的编写呢,这节我们将会揭晓答案。

2.1模型配置

在编写模型的同时,我们需要编写模型的配置,主要和具体模型的超参数有关。

class MMBTConfig(object):
    """Configuration class to store the configuration of a `MMBT Model`.

    Args:
        config (:obj:`~transformers.PreTrainedConfig`):
            Config of the underlying Transformer models. Its values are
            copied over to use a single config.
        num_labels (:obj:`int` or :obj:`None`, optional, defaults to `None`):
            Size of final Linear layer for classification.
        modal_hidden_size (:obj:`int`, optional, defautls to 2048):
            Embedding dimension of the non-text modality encoder.
    """

    def __init__(self, config, num_labels=None, modal_hidden_size=2048):
        self.__dict__ = config.__dict__
        self.modal_hidden_size = modal_hidden_size
        if num_labels:
            self.num_labels = num_labels

2.2 模型搭建

这个例子我们有三层模型,自大到小分别是MMBTForClassification,MMBTModel以及ModalEmbeddings。我们从最外层开始,逐步解剖。

2.2.1 最外层

最外层是将模型应用于分类,因此里面主要就是增加了dropout和分类层。主要看forward部分。这部分主要做了以下4件事情:

  1. 使用MMBT模型获得输出
  2. 经过自身获取最终结果
  3. 是否有标签来决定在内部是否算损失
  4. 拼接出最后的结果传递给外面
class MMBTForClassification(nn.Module):

    def __init__(self, config, transformer, encoder):
        super().__init__()
        self.num_labels = config.num_labels

        self.mmbt = MMBTModel(config, transformer, encoder)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

    def forward(
        self,
        input_modal,
        input_ids=None,
        modal_start_tokens=None,
        modal_end_tokens=None,
        attention_mask=None,
        token_type_ids=None,
        modal_token_type_ids=None,
        position_ids=None,
        modal_position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
    ):

        outputs = self.mmbt(
            input_modal=input_modal,
            input_ids=input_ids,
            modal_start_tokens=modal_start_tokens,
            modal_end_tokens=modal_end_tokens,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            modal_token_type_ids=modal_token_type_ids,
            position_ids=position_ids,
            modal_position_ids=modal_position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
        )

        pooled_output = outputs[1]

        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here

        if labels is not None:
            if self.num_labels == 1:
                #  We are doing regression
                loss_fct = MSELoss()
                loss = loss_fct(logits.view(-1), labels.view(-1))
            else:
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            outputs = (loss,) + outputs

        return outputs  # (loss), logits, (hidden_states), (attentions)

2.2.2 中间层

这一层由于非常长,我们就不给出代码了,这里主要的工作就是创造自己的模型,也是核心的部分,就是当给定输入的时候,我们设计的模型该如何进行处理。它的功能就是一个承上启下的过程,它是核心模块,用于外层的分类模型进行分类,同时,它也是处理输入,将其应用到我们一个具体的已有的模型之中,例如albert等。

2.2.3 最后层

最后一层就是具体的编码层,它不能再细分,是模型的最基本的组成部分。如果打个比方,它就是送快递时,你买的物品,再上一层的中间层就是外面的快递盒子,将你特殊的物品包装成统一的形式,最外层则是快递员派送,将盒子送往你想要的地方。因此你的模型的复杂程度也取决于你用了多少层。

一般的最后一层都是最基础的模块,然后中间层是一个组装层,最外面一层则是封装成针对下游任务的一层,给出分类结果并传出。因此把握好这三层,你就可以写出非常规范的代码了。

3. 自己动手——简单分类任务

通过上面的详细描述,我们对于实验的整体步骤有一个比较清醒的认识了。那么接下来,我们就使用我们所学的东西,搭建一个我们自己的分类模型,这里我们使用一个关系对进行分类。
由于transformers官方没有中文版本的albert,因此我使用了另外一个版本的代码,这个版本的代码和官方的很像,但是是中国人写的,因此提供了中文的albert模型,并且有部分注释是中文的(也存在一些错别字)。

3.1 数据处理

首先我们,创建自己的一个dataprocessor,用于读取文件数据。

class RelationProcessor(DataProcessor):
    """Processor for the Relation data set (GLUE version)."""

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
    def get_test_examples(self, data_dir):
        """See base class."""
        return self._create_examples(
            self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

    def get_labels(self):
        """See base class."""
        return ['Joint',
                     'Sequence',
                     'Progression',
                     "Contrast",
                     "Supplement",
                     "Cause-Result",
                     "Result-Cause",
                     "Background",
                     "Behavior-Purpose",
                     "Purpose-Behavior",
                     "Elaboration",
                     "Summary",
                     "Evaluation",
                     "Statement-Illustration",
                     "Illustration-Statement"
                     ]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training and dev sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, line[0])
            text_a = line[1]
            text_b = line[2]
            label = line[-1]
            examples.append(
                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples

3.2 测试过程

之前的样例中并没有测试的代码,我们正好可以借此机会实现一次。

def test(args, model, tokenizer, prefix=""):
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    test_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
    test_outputs_dirs = (args.output_dir, args.output_dir + '-MM') if args.task_name == "mnli" else (args.output_dir,)

    results = {}
    for test_task, test_output_dir in zip(test_task_names, test_outputs_dirs):
        test_dataset = load_and_cache_examples(args, test_task, tokenizer, data_type='test')
        if not os.path.exists(test_output_dir) and args.local_rank in [-1, 0]:
            os.makedirs(test_output_dir)

        args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
        # Note that DistributedSampler samples randomly
        test_sampler = SequentialSampler(test_dataset) if args.local_rank == -1 else DistributedSampler(test_dataset)
        test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=args.eval_batch_size,
                                     collate_fn=collate_fn)

        # Test!
        logger.info("***** Running test {} *****".format(prefix))
        logger.info("  Num examples = %d", len(test_dataset))
        logger.info("  Batch size = %d", args.eval_batch_size)
        eval_loss = 0.0
        nb_eval_steps = 0
        preds = None
        out_label_ids = None
        pbar = ProgressBar(n_total=len(test_dataloader), desc="Testing")
        for step, batch in enumerate(test_dataloader):
            model.eval()
            batch = tuple(t.to(args.device) for t in batch)
            with torch.no_grad():
                inputs = {'input_ids': batch[0], 'attention_mask': batch[1], 'labels': batch[3],
                          'token_type_ids': batch[2]}
                outputs = model(**inputs)
                tmp_eval_loss, logits = outputs[:2]
                eval_loss += tmp_eval_loss.mean().item()
            nb_eval_steps += 1
            if preds is None:
                preds = logits.detach().cpu().numpy()
                out_label_ids = inputs['labels'].detach().cpu().numpy()
            else:
                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
                out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)
            pbar(step)
        print(' ')
        if 'cuda' in str(args.device):
            torch.cuda.empty_cache()
        eval_loss = eval_loss / nb_eval_steps
        if args.output_mode == "classification":
            preds = np.argmax(preds, axis=1)
        elif args.output_mode == "regression":
            preds = np.squeeze(preds)
        result = compute_metrics(test_task, preds, out_label_ids)
        results.update(result)
        logger.info("***** Test results {} *****".format(prefix))
        for key in sorted(result.keys()):
            logger.info("  %s = %s", key, str(result[key]))
        classreport=ClassReport(['Joint',
                     'Sequence',
                     'Progression',
                     "Contrast",
                     "Supplement",
                     "Cause-Result",
                     "Result-Cause",
                     "Background",
                     "Behavior-Purpose",
                     "Purpose-Behavior",
                     "Elaboration",
                     "Summary",
                     "Evaluation",
                     "Statement-Illustration",
                     "Illustration-Statement"
                     ])
        classreport(preds,out_label_ids)
        logger.info("%s : %s",classreport.name(),classreport.value())

    return results

3.主过程

我们只需要在主过程中增加测试的代码即可。

# Test
    results = []
    if args.do_predict and args.local_rank in [-1, 0]:
        tokenizer = tokenization_albert.FullTokenizer(vocab_file=args.vocab_file,
                                                              do_lower_case=args.do_lower_case,
                                                              spm_model_file=args.spm_model_file)
        checkpoints = [(0, args.output_dir)]
        if args.predict_all_checkpoints:
            checkpoints = list(
                os.path.dirname(c) for c in
                sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
            checkpoints = [(int(checkpoint.split('-')[-1]), checkpoint) for checkpoint in checkpoints if
                                   checkpoint.find('checkpoint') != -1]
            checkpoints = sorted(checkpoints, key=lambda x: x[0])
        logger.info("Test the following checkpoints: %s", checkpoints)
        for _, checkpoint in checkpoints:
            global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split('/')[-1] if checkpoint.find('checkpoint') != -1 else ""

            model = AlbertForSequenceClassification.from_pretrained(checkpoint)
            model.to(args.device)
            result = test(args, model, tokenizer, prefix=prefix)
            results.extend([(k + '_{}'.format(global_step), v) for k, v in result.items()])
        output_test_file = os.path.join(args.output_dir, "checkpoint_test_results.txt")
        with open(output_test_file, "w") as writer:
            for key, value in results:
                writer.write("%s = %s\n" % (key, str(value)))

然后运行下面命令,就可以完美运行。

CUDA_VISIBLE_DEVICES=5 python3 run_classifier_relation.py \
  --model_type=albert \
  --model_name_or_path=./albert_base_zh/pytorch_model.bin \
  --vocab_file=./albert_base_zh/vocab.txt \
  --config_name=./albert_base_zh/config.json \
  --task_name=relation \
  --do_train \
  --do_eval \
  --do_predict \
  --predict_all_checkpoints \
  --do_lower_case \
  --data_dir=./dataset/relation/ \
  --max_seq_length=512 \
  --per_gpu_train_batch_size=2 \
  --per_gpu_eval_batch_size=2 \
  --learning_rate=1e-5 \
  --num_train_epochs=5.0 \
  --logging_steps=1192 \
  --save_steps=1192 \
  --output_dir=./outputs/relation_output/ \
  --overwrite_output_dir \
  --seed=42

4. 小结

通过以上的介绍,我们熟悉了transformers的各个模型及应用,然后依葫芦画瓢构建了属于自己的例子,再接下来,我们就要更上一层,构建更加高级的模型了。正所谓,路漫漫其修远兮。

发布了232 篇原创文章 · 获赞 547 · 访问量 51万+

猜你喜欢

转载自blog.csdn.net/qq_35082030/article/details/104449487