logo资料库

最新《预训练语言模型》2020综述论文大全【复旦大学】.pdf

第1页 / 共25页
第2页 / 共25页
第3页 / 共25页
第4页 / 共25页
第5页 / 共25页
第6页 / 共25页
第7页 / 共25页
第8页 / 共25页
资料共25页,剩余部分请下载后查看
1 Introduction
2 Background
2.1 Language Representation Learning
2.2 Neural Contextual Encoders
2.3 Why Pre-training?
2.4 A Brief History of PTMs for NLP
2.4.1 Pre-trained word embeddings
2.4.2 Pre-trained contextual encoders
3 Overview of PTMs
3.1 Pre-training Tasks
3.1.1 Language Modeling (LM)
3.1.2 Masked Language Modeling (MLM)
3.1.3 Permuted Language Modeling (PLM)
3.1.4 Denoising Autoencoder (DAE)
3.1.5 Contrastive Learning (CTL)
3.1.6 Others
3.2 Taxonomy of PTMs
3.3 Model Analysis
3.3.1 Non-Contextual Embeddings
3.3.2 Contextual Embeddings
4 Extensions of PTMs
4.1 Knowledge-Enriched PTMs
4.2 Multi-Modal PTMs
4.3 Model Compression
4.3.1 Model Pruning
4.3.2 Quantization
4.3.3 Parameter Sharing
4.3.4 Knowledge Distillation
4.4 Domain-Specific PTMs
4.5 Multilingual and Language-Specific PTMs
5 Adapting PTMs to Downstream Tasks
5.1 Transfer Learning
5.2 How to Transfer?
5.2.1 Choosing appropriate pre-training task, model architecture and corpus
5.2.2 Choosing appropriate layers
5.2.3 To tune or not to tune?
5.3 Fine-Tuning Strategies
6 Resources of PTMs
6.1 Open-Source Implementations
6.2 Collections of Related Resources
7 Applications
7.1 General Evaluation Benchmark
7.2 Machine Translation
7.3 Question Answering
7.4 Sentiment Analysis
7.5 Summarization
7.6 Named Entity Recgonition
8 Future Directions
9 Conclusion
. Invited Review . 0 2 0 2 r a M 8 1 ] L C . s c [ 1 v 1 7 2 8 0 . 3 0 0 2 : v i X r a Pre-trained Models for Natural Language Processing: A Survey Xipeng Qiu*, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai & Xuanjing Huang School of Computer Science, Fudan University, Shanghai 200433, China; Shanghai Key Laboratory of Intelligent Information Processing, Shanghai 200433, China Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy with four perspectives. Next, we describe how to adapt the knowledge of PTMs to the downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks. Deep Learning, Neural Network, Natural Language Processing, Pre-trained Model, Distributed Representation, Word Embedding, Self-Supervised Learning, Language Modelling 1 Introduction With the development of deep learning, various neural net- works have been widely used to solve Natural Language Pro- cessing (NLP) tasks, such as convolutional neural networks (CNNs) [75, 80, 45], recurrent neural networks (RNNs) [160, 100], graph-based neural networks (GNNs) [146, 161, 111] and attention mechanisms [6, 171]. One of the advantages of these neural models is their ability to alleviate the fea- ture engineering problem. Non-neural NLP methods usually heavily rely on the discrete handcrafted features, while neural methods usually use low-dimensional and dense vectors (aka. distributed representation) to implicitly represent the syntactic or semantic features of the language. These representations are learned in specific NLP tasks. Therefore, neural methods make it easy for people to develop various NLP systems. Despite the success of neural models for NLP tasks, the performance improvement may be less significant compared to the Computer Vision (CV) field. The main reason is that current datasets for most supervised NLP tasks are rather small (except machine translation). Deep neural networks usually * Corresponding author (email: xpqiu@fudan.edu.cn) have a large number of parameters which make them overfit on these small training data and do not generalize well in practice. Therefore, the early neural models for many NLP tasks were relatively shallow and usually consisted of only 1∼3 neural layers. Recently, substantial work has shown that pre-trained mod- els (PTMs) on the large corpus can learn universal language representations, which are beneficial for downstream NLP tasks and can avoid training a new model from scratch. With the development of computational power, the emergence of the deep models (i.e., Transformer [171]) and the constant enhancement of training skills, the architecture of PTMs has been advanced from shallow to deep. The first-generation PTMs aim to learn good word embeddings. Since these mod- els themselves are no longer needed by downstream tasks, they are usually very shallow for computational efficiencies, such as Skip-Gram [116] and GloVe [120]. Although these pre- trained embeddings can capture semantic meanings of words, they are context-free and fail to capture higher-level concepts of text like syntactic structures, semantic roles, anaphora, etc.
2 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) The second-generation PTMs focus on learning contextual word embeddings, such as CoVe [113], ELMo [122], OpenAI GPT [130] and BERT [32]. These learned encoders are still needed to represent words in context by downstream tasks. Besides, various pre-training tasks are also proposed to learn PTMs for different purposes. The contributions of this survey can be summarized as follows: 1. Comprehensive review. We provide a comprehensive review of PTMs for NLP, including background knowl- edge, model architecture, pre-training tasks, various extensions, adaption approaches, and applications. We provide detailed descriptions of representative models, make the necessary comparison, and summarise the corresponding algorithms. 2. New taxonomy. We propose a taxonomy of PTMs for NLP, which categorizes existing PTMs from four dif- ferent perspectives: 1) type of word representation; 2) architecture of PTMs; 3) type of pre-training tasks; 4) extensions for specific types of scenarios or inputs. 3. Abundant resources. We collect abundant resources on PTMs, including open-source systems, paper lists, etc. 4. Future directions. We discuss and analyze the limi- tations of existing PTMs. Also, we suggest possible future research directions. The rest of the survey is organized as follows. Section 2 outlines the background concepts and commonly used nota- tions of PTMs. Section 3 gives a brief overview of PTMs and clarifies the categorization of PTMs. Section 4 provides extensions of PTMs. Section 5 discusses how to transfer the knowledge of PTMs to downstream tasks. Section 6 gives the related resources on PTMs, including open-source systems, paper lists, etc. Section 7 presents a collection of applications across various NLP tasks. Section 8 discusses the current chal- lenges and suggests future directions. Section 9 summarizes the paper. 2 Background 2.1 Language Representation Learning As suggested by Bengio et al. [12], a good representation should express general-purpose priors that are not task-specific but would be likely to be useful for a learning machine to solve AI-tasks. When it comes to language, a good representation should capture the implicit linguistic rules and common sense knowledge hiding in text data, such as lexical meanings, syn- tactic structures, semantic roles, and even pragmatics. The core idea of distributed representation is to describe the meaning of a piece of text by low-dimensional real-valued vec- tors. And each dimension of the vector has no corresponding sense while the whole represents a concrete concept. Figure 1 illustrates the generic neural architecture for NLP. There are two kinds of word embeddings: non-contextual and contex- tual embeddings. The difference between them is whether the embedding for a word dynamically changes according to the context it appears in. Figure 1: Generic Neural Architecture for NLP. Non-contextual Embeddings To represent language, the first step is to map discrete language symbols into a distributed embedding space. Formally, for each word (or sub-word) x in a vocabulary V, we map it to a vector ex ∈ RDe with a lookup table E ∈ RDe×|V|, where De is a hyper-parameter indicating the dimension of token embeddings. These embeddings are trained on task data along with other model parameters. There are two main limitations to this kind of embeddings. The first issue is that the embeddings are static. The embed- ding for a word does is always the same regardless of its context. Therefore, these non-contextual embeddings fail to model polysemous words. The second issue is the out-of- vocabulary problem. To tackle this problem, character-level word representations or sub-word representations are widely used in many NLP tasks, such as CharCNN [81], FastText [13] and Byte-Pair Encoding (BPE) [141]. Contextual Embeddings To address the issue of polyse- mous and the context-dependent nature of words, we need distinguish the semantics of words in different contexts. Given a text x1, x2,··· , xT where each token xt ∈ V is a word or sub-word, the contextual representation of xt depends on the whole text. [h1, h2,··· , hT ] = fenc(x1, x2,··· , xT ), (1) where fenc(·) is neural encoder, which is described in Sec- tion 2.2, ht is called contextual embedding or dynamical em- bedding of token xt because of the contextual information included in. ex1ex2ex3ex4ex5ex6ex7Non-contextualEmbeddingsh1h2h3h4h5h6h7ContextualEmbeddingsContextualEncoderTask-SpecifcModel
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 3 (a) Convolutional model (b) Sequential model (c) Fully-connected graph-based model Figure 2: Neural Contextual Encoders 2.2 Neural Contextual Encoders Most of the neural contextual encoders can be classified into three categories: convolutional models, sequential models, and graph-based models. Figure 2 illustrates the architecture of these models. (1) Convolutional models. Convolutional models take the embeddings of words in the input sentence and capture the meaning of a word by aggregating the local information from its neighbors by convolution operations [80]. Convolutional models are usually easy to train and can capture the local contextual information. (2) Sequential models. Sequential models usually adopt RNNs (such as LSTM [60] and GRU [21]) to capture the con- textual representation of a word. In practice, bi-directional RNNs are used to collect information from both sides of a word, but its performance is often affected by the long-term dependency problem. (3) Graph-based models. Different from the above models, graph-based models take the word as nodes and learn the con- textual representation with a pre-defined linguistic structure between words, such as the syntactic structure [146, 161] or semantic relation [111]. Although the linguistic-aware graph structure can provide useful inductive bias, how to build a good graph structure is also a challenging problem. Besides, the structure depends heavily on expert knowledge or external NLP tools, such as the dependency parser. In practice, a more straightforward way is to use a fully- connected graph to model the relation of every two words and let the model learn the structure by itself. Usually, the connec- tion weights are dynamically computed by the self-attention mechanism, which implicitly indicates the connection between words. A successful implementation of such an idea is the Trans- former [171], which adopts the fully-connected self-attention architecture as well as other useful designs, such as positional embeddings, layer normalization, and residual connections. Analysis Both convolutional and sequential models learn the contextual representation of the word with locality bias and are hard to capture the long-range interactions between words. In contrast, Transformer can directly model the depen- dency between every two words in a sequence, which is more powerful and suitable to model the language. However, due to its heavy structure and less model bias, Transformer usually requires a large training corpus and is easy to overfit on small or modestly-sized datasets [130, 49]. 2.3 Why Pre-training? With the development of deep learning, the number of model parameters has increased rapidly. The much larger dataset is needed to fully train model parameters and prevent overfit- ting. However, building large-scale labeled datasets is a great challenge for most NLP tasks due to the extremely expen- sive annotation costs, especially for syntax and semantically related tasks. In contrast, large-scale unlabeled corpora are relatively easy to construct. To leverage the huge unlabeled text data, we can first learn a good representation from them and then use these representations for other tasks. Recent studies have demon- strated significant performance gains on many NLP tasks with the help of the representation extracted from the PTMs on the large unannotated corpora. The advantages of pre-training can be summarized as fol- lows: 1. Pre-training on the huge text corpus can learn universal language representations and help with the downstream tasks. 2. Pre-training provides a better model initialization, which usually leads to a better generalization perfor- mance and speeds up convergence on the target task. 3. Pre-training can be regarded as a kind of regularization to avoid overfitting on small data [39]. 2.4 A Brief History of PTMs for NLP Pre-training has always been an effective strategy to learn the parameters of deep neural networks, which are then fine-tuned on downstream tasks. As early as 2006, the breakthrough of deep learning came with greedy layer-wise unsupervised pre-training followed by supervised fine-tuning [58]. In CV, it h1h2h3h4h5x1x2x3x4x5h1h2h3h4h5x1x2x3x4x5h1h2h3h4h5x1x2x3x4x5
4 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) has been in practice to pre-train models on the huge ImageNet corpus, and then fine-tune further on smaller data for different tasks. This is much better than a random initialization because the model learns general image features which can then be used in various vision tasks. In NLP, PTMs on large corpus have also been proved to be beneficial for the downstream NLP tasks, from the shallow word embedding to deep neural models. 2.4.1 Pre-trained word embeddings Representing words as dense vectors has a long history [56]. The “modern” word embedding is introduced in pioneer work of neural network language model (NNLM) [11]. Collobert et al. [24] showed that the pre-trained word embedding on the unlabelled data can significantly improve many NLP tasks. To address the computational complexity, they learned word embeddings with pairwise ranking task instead of language modeling. Their work is the first attempt to obtain generic word embeddings useful for other tasks from unlabeled data. Mikolov et al. [116] showed that there is no need for deep neural networks to build good word embeddings. They pro- pose two shallow architectures: Continuous Bag-of-Words (CBOW) and Skip-Gram (SG) models. Although the pro- posed models are simple and shallow, they can still learn the effective word embeddings capturing the latent syntactic and semantic similarities. Word2vec is one of the most popular implementations of these models and makes the pre-trained word embeddings accessible for different tasks in NLP. Be- sides, GloVe [120] is also a widely-used model for obtaining pre-trained word embeddings, which are computed by global word-word co-occurrence statistics from a corpus. Although pre-trained word embeddings have been shown ef- fective in NLP tasks, they are context-independent and mostly trained by shallow models. When used in a downstream task, the rest of the whole model still needs to be learned from scratch. During the same time period, many researchers also try to learn embeddings of paragraph, sentence or document, such as paragraph vector [89], Skip-thought vectors [82], Con- text2Vec [114] and so on. Different from their modern suc- cessors, these sentence embedding models try to encode in- put sentences into a fixed-dimensional vector representation, rather than the contextual representation for each token. depending on its context. McCann et al. [113] pre-trained a deep LSTM encoder from an attentional sequence-to-sequence model with ma- chine translation (MT). The context vectors (CoVe) output by the pretrained encoder can improve the performance of a wide variety of common NLP tasks. Peters et al. [122] pre-trained 2-layer LSTM encoder with a bidirectional language model (BiLM), consisting of a forward LM and a backward LM. The contextual representations output by the pre-trained BiLM, ELMo (Embeddings from Language Models), are shown to bring large improvements on a broad range of NLP tasks. Ak- bik et al. [1] captured word meaning with contextual string embeddings pre-trained with character-level LM. However, these PTMs are usually used as a feature extrac- tor to produce the contextual word embeddings, which are fed into the main model for downstream tasks. Their parameters are fixed and the rest parameters of the main model are still trained from scratch. Ramachandran et al. [134] found the seq2seq models can be significantly improved by unsupervised pre-training. The weights of both encoder and decoder are initialized with pre- trained weights of two language models and then fine-tuned with labeled data. ULMFiT (Universal Language Model Fine- tuning) [62] attempted to fine-tune pre-trained LM for text classification (TC) and achieved state-of-the-art results on six widely-used TC datasets. ULMFiT consists of 3 phases: 1) pre-training LM on general-domain data; 2) fine-tuning LM on target data; 3) fine-tuning on the target task. ULMFiT also in- vestigates some effective fine-tuning strategies, including dis- criminative fine-tuning, slanted triangular learning rates, and gradual unfreezing. Since ULMFiT, fine-tuning has become the mainstream approach to adapt PTMs for the downstream tasks. More recently, the very deep PTMs have shown their pow- erful ability in learning universal language representations: e.g., OpenAI GPT (Generative Pre-training) [130] and BERT (Bidirectional Encoder Representation from Transformer) [32]. Besides LM, an increasing number of self-supervised tasks (see Section 3.1) are proposed to make the PTMs capturing more knowledge form large scale text corpora. 3 Overview of PTMs 2.4.2 Pre-trained contextual encoders Since most NLP tasks are beyond word-level, it is natural to pre-train the neural encoders on sentence-level or higher. The output vectors of neural encoders are also called contextual word embeddings since they represent the word semantics The major differences between PTMs are the usages of con- textual encoders, pre-training tasks, and purposes. We have briefly introduced the architectures of contextual encoders in Section 2.2. In this section, we focus on the description of pre-training tasks and give a taxonomy of PTMs.
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 5 3.1 Pre-training Tasks The pre-training tasks are crucial for learning the universal representation of language. Usually, these pre-training tasks should be challenging and have substantial training data. In this section, we summarize the pre-training tasks into three categories1): supervised learning, unsupervised learning, and self-supervised learning. 1. Supervised learning is to learn a function that maps an input to an output based on training data consisting of input-output pairs. 2. Unsupervised learning is to find some intrinsic knowl- edge, such as clusters, densities, latent representations, from unlabeled data. 3. Self-Supervised learning (SSL) is a blend of supervised learning and unsupervised learning. The key idea of SSL is to predict any part of the input from other parts in some form. For example, the masked language model (MLM) is a self-supervised task that attempts to predict the masked words in a sentence given the rest words. In CV, many PTMs are trained on large supervised training sets like ImageNet. However, in NLP field, the datasets of most supervised tasks are not large enough to train a good PTM. The only exception is machine translation (MT). A large-scale MT dataset, WMT 2017, consists of more than 7 million sentence pairs. Besides, MT is one of the most challenging tasks in NLP, and an encoder pretrained on MT can benefit a variety of downstream NLP tasks. As a success- ful PTM, CoVe [113] is an encoder pretrained on MT task and improves a wide variety of common NLP tasks: senti- ment analysis (SST, IMDb), question classification (TREC), entailment (SNLI), and question answering (SQuAD). The pre-training tasks widely-used in existing PTMs are listed as follows: 3.1.1 Language Modeling (LM) The most common unsupervised task in NLP is probabilistic language modeling (LM), which is a classic probabilistic den- sity estimation problem. Although LM is a general concept, in practice, LM often refers in particular to auto-regressive LM or unidirectional LM. Given a text sequence x1:T = [x1, x2,··· , xT ], its joint prob- ability p(x1:T ) can be decomposed as T t=1 p(x1:T ) = p(xt|x0:t−1), (2) where x0 is special token indicating the begin of sequence. The conditional probability p(xt|x0:t−1) can be modeled by a probability distribution over the vocabulary given linguistic context x0:t−1. The context x0:t−1 is modeled by neural encoder fenc(·), and the conditional probability is p(xt|x0:t−1) = gLM where gLM(·) is prediction layer. fenc(x0:t−1) (3) , Given a huge corpus, we can train the entire network with a maximum-likelihood estimation (MLE). A drawback of unidirectional LM is that the representation of each token encodes only the leftward context tokens and it- self. However, better contextual representations of text should encode contextual information from both directions. An im- proved solution is bidirectional LM (BiLM), which consists of two unidirectional LMs: a forward left-to-right LM and a backward right-to-left LM. For BiLM, Baevski et al. [5] proposed a two-tower model that the forward tower operates the left-to-right LM and the backward tower operates the right-to-left LM. 3.1.2 Masked Language Modeling (MLM) Masked language modeling (MLM) is first proposed by Tay- lor [165] in the literature, who referred this as a Cloze task. Devlin et al. [32] adapted this task as a novel pre-training task to overcome the drawback of the standard unidirectional LM. Loosely speaking, MLM first masks out some tokens from the input sentences and then trains the model to predict the masked tokens by the rest of tokens. However, this pre-training method will create a mismatch between the pre-training phase and the fine-tuning phase, because the mask token does not appear during the fine-tuning phase. Empirically, to deal with this issue, Devlin et al. [32] used a special [MASK] token 80% of the time, a random token 10% of the time and the original token 10% of the time to perform masking. Sequence-to-Sequence MLM (Seq2Seq MLM) MLM is usually solved as classification problem. We feed the masked sequences to a neural encoder, whose output vectors are fur- ther fed into a softmax classifier to predict the masked token. Alternatively, we can use encoder-decoder (aka. sequence- to-sequence) architecture for MLM, in which the encoder is fed a masked sequence and the decoder sequentially produces the masked tokens in auto-regression fashion. We refer to this kind of MLM as sequence-to-sequence MLM (Seq2Seq MLM), which is used in MASS [147] and T5 [132]. Seq2Seq MLM can benefit the Seq2Seq-style downstream tasks, such as question answering, summarization and machine transla- tion. 1) Indeed, it is hard to clearly distinguish the unsupervised learning and self-supervised learning. For clarification, we refer “unsupervised learning” to the learning without human-annotated supervised labels”.
6 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) Enhanced Masked Language Modeling (E-MLM) Con- currently, there are multiple research proposing different en- hanced versions of MLM to further improve on BERT. Instead of static masking, RoBERTa [105] improves BERT by dy- namic masking. UniLM [35, 7] extends the task of mask prediction on three types of language modeling tasks: unidirectional, bidirec- tional, and sequence-to-sequence prediction. XLM [25] performs MLM on a concatenation of parallel bilingual sentence pairs, called translation language modeling (TLM). To integrate structure information into pre-training, Span- BERT [72] replaces MLM with random contiguous words masking and Span Boundary Objective (SBO), which requires the system to predict masked spans based on span boundaries. Besides, StructBERT [180] introduces the span order recovery task to further incorporating language structures. Another way to enrich MLM is to incorporate external knowledge. ERNIE(Baidu) [157, 158] proposes to mask phrases and entities instead of random spans. Knowledge is also introduced by incorporating entity embeddings as in- puts, such as E-BERT [128], and ERNIE(THU) [199]. 3.1.3 Permuted Language Modeling (PLM) Despite the wide use of the MLM task in pre-training, Yang et al. [194] claimed that some special tokens used in pre- training of MLM, like [MASK], are absent when the model is applied on downstream tasks, leading to a gap between pre- training and fine-tuning. To overcome this issue, Permuted Language Modeling (PLM) [194] is a pre-training objective to replace MLM. In short, PLM is a language modeling task on a random permutation of input sequences. Given a sequence, a permutation is randomly sampled from all possible permu- tations. Then some of the tokens in the permuted sequence are chosen as the target and the model is trained to predict these targets, depending on the rest of tokens and the natural positions of targets. Note that this permutation does not affect the natural positions of sequences and only defines the order of token predictions. In practice, only the last few tokens in the permuted sequences are predicted, due to the slow conver- gence. And a special two-stream self-attention is introduced for target-aware representations. 3.1.4 Denoising Autoencoder (DAE) Denoising autoencoder (DAE) takes a partially corrupted input and aims to recover the original undistorted input. Specific to language, a sequence-to-sequence model, such as the standard Transformer, is used to reconstruct the original text. There are several ways to corrupt text [93]: (1) Token Masking: Randomly sampling tokens from the input and replacing them with [MASK] elements. (2) Token Deletion: Randomly deleting tokens from the input. Different to token masking, the model need decide the positions of missing inputs. (3) Text Infilling: Like SpanBERT, a number of text spans are sampled and replaced with a single [MASK] token. Each span length is drawn from a Poisson distribution (λ = 3). The model need predict how many tokens are missing from a span. (4) Sentence Permutation: Dividing a document into sen- tences based on full stops, and shuffling these sentences in a random order. (5) Document Rotation: Selecting a token uniformly at random, and rotating the document so that it begins with that token. The model need identify the real start position of the document. 3.1.5 Contrastive Learning (CTL) Contrastive learning [140] assumes some observed pairs of text that are more semantically similar than randomly sampled text. A score function s(x, y) for text pair (x, y) is learned to minimize the objective function: − log exps(x, y+) exps(x, y+) + exps(x, y−) Ex,y+,y− , (4) where (x, y+) are a similar pair and y− is presumably dissimi- lar to x. y+ and y− are typically called positive and negative sample. The score function s(x, y) is often computed by a learnable neural encoder in two ways: s(x, y) = f T enc(x) fenc(y) or s(x, y) = fenc(x ⊕ y). The idea behind CTL is “learning by comparison”. Com- pared to LM, CTL usually has less computational complex- ity, and therefore is desirable alternative training criteria for PTMs. Collobert et al. [24] proposed pairwise ranking task to dis- tinguish real and fake phrases. The model need to predict a higher score for legal phrase than an incorrect phrase obtained by replacing its central word with a random words. Mnih and Kavukcuoglu [118] trained word embeddings efficiently with Noise-Contrastive Estimation (NCE) [51], which trains a binary classifier to distinguish real and fake samples. The idea of NCE is also used in the well-known word2vec embed- ding [116]. We briefly describes some recently proposed CTL tasks in the following paragraphs. Deep InfoMax (DIM) Deep InfoMax (DIM) [59] is origi- nal proposed for images, which improves the quality of the representation by maximizing the mutual information between an image representation and local regions of the image.
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 7 Kong et al. [83] applied DIM to language representation learning. The global representation of a sequence x is defined to be the hidden state of the first token (assumed to be a spe- cial start of sentence symbol) output by contextual encoder fenc(x). The objective of DIM is to assign a higher score for fenc(xi: j)T fenc( ˆxi: j) than fenc( ˜xi: j)T fenc( ˆxi: j), where xi: j denotes an n-gram2) span from i to j in x, ˆxi: j denotes a sentence masked at position i to j, and ˜xi: j denotes a randomly-sampled negative n-gram from corpus. Replaced Token Detection (RTD) Replaced Token Detec- tion (RTD) is the same as NCE but predicts whether a token is replaced given its surrounding context. CBOW with negative sampling (CBOW-NS) [116] can be viewed as a simple version of RTD, in which the negative samples are randomly sampled from vocabulary with simple proposal distribution. ELECTRA [22] improves RTD by utilizing a generator to replacing some tokens of a sequence. A generator G and a dis- criminator D are trained following a two-stage procedure: (1) Train only the generator with MLM task for n1 steps; (2) Ini- tialize the weights of the discriminator with the weights of the generator. Then train the discriminator with a discriminative task for n2 steps, keeping G frozen. Here the discriminative task indicates justifying whether the input token has been re- placed by G or not. The generator is thrown after pre-training, and only the discriminator will be fine-tuned on downstream tasks. RTD is also an alternative solution for the mismatch prob- lem the network sees [MASK] during pre-training but not when being fine-tuned in downstream tasks. Similarly, WKLM [188] replaces words on the entity-level instead of token-level. Concretely, WKLM replaces entity mentions with names of other entities of the same type and train the models to distinguish whether the entity has been replaced. Next Sentence Prediction (NSP) Punctuations are the nat- ural separators of text data. So, it is reasonable to construct pre-training methods by utilizing them. Next Sentence Predic- tion (NSP) [32] is just a great example for this. As its name suggests, NSP trains the model to distinguish whether two input sentences are continuous segments from the training cor- pus. Specifically, when choosing the sentences pair for each pre-training example, 50% of the time the second sentence is the actual next sentence of the first one, and 50% of the time it is a random sentence from the corpus. By doing so, it is capable to teach the model to understand the relationship between two input sentences and thus benefit downstream tasks that are sensitive to this information, such as Question 2) n is drawn from a Gaussian distribution N(5, 1) clipped at 1 (minimum length) and 10 (maximum length). Answering and Natural Language Inference. However, the necessity of the NSP task has been ques- tioned by subsequent work [72, 194, 105, 86]. Yang et al. [194] found the impact of the NSP task unreliable, while Joshi et al. [72] found that single-sentence training without the NSP loss is superior to sentence-pair training with the NSP loss. Moreover, Liu et al. [105] conducted further analysis for the NSP task, which shows that when training with blocks of text from a single document, removing the NSP loss matches or slightly improves performance on downstream tasks. Sentence Order Prediction (SOP) To better model inter- sentence coherence, ALBERT [86] replaces the NSP loss with a sentence order prediction (SOP) loss. As conjectured in Lan et al. [86], NSP conflates topic prediction and coherence prediction in a single task. However, topic prediction is eas- ier to learn compared to coherence prediction, which allows the model to make predictions merely rely on topic learning. Different to NSP, SOP uses two consecutive segments from the same document as positive examples, and the same two consecutive segments but with their order swapped as negative examples. As a result, ALBERT consistently outperforms BERT on various downstream tasks. StructBERT [180] and BERTje [29] also take SOP as their self-supervised learning task. 3.1.6 Others Apart from the above tasks, there are many other tasks desig- nated for specific tasks, such as sentiment label-aware MLM for sentiment analysis [78], gap sentence generation (GSG) for text summarization [197], disfluency detection [179] and so on. In addition, some auxiliary pre-training tasks are designed to incorporate factual knowledge, such as denoising entity auto-encoding (dEA) in ERNIE(THU) [199], entity linking (EL) in KnowBERT [123]. Furthermore, several tasks are introduced to obtain multi- modal pre-trained model. Typically, tasks like visual-based MLM, masked visual-feature modeling and visual-linguistic matching are widely used in multi-modal pre-training, such as VideoBERT [152], VisualBERT [95], ViLBERT [107] and so on. 3.2 Taxonomy of PTMs To clarify the relations of existing PTMs for NLP, we build the taxonomy of PTMs, which categorizes existing PTMs from different perspective: (1) the type of word representation used by PTMs, (2) the backbone network used by PTMs, (3) the type of pre-training tasks used by PTMs, and (4) the PTMs de- signed for specific types of scenarios or inputs. Figure 3 shows
8 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) Contextual? Architectures Task Types PTMs Unsupervised/ Self-Supervised Static Contextual LSTM CBOW, Skip-Gram [116] GloVe [120] ELMo [122], GPT [130], BERT [32] ELMo [122], CoVe [113] Transformer Enc. BERT [32], SpanBERT [105], XLNet [194], RoBERTa [105] Transformer Dec. GPT [130], GPT-2 [131] Transformer MASS [147], BART [93] Supervised MT LM MLM PLM DAE CTL CoVe [113] ELMo [122], GPT [130], GPT-2 [131], UniLM [35] BERT [32], SpanBERT [105], XLM [25], MASS [147], T5 [132] XLNet [194] BART [93] RTD NSP SOP CBOW-NS [116], ELECTRA [22] BERT [32], UniLM [35] ALBERT [86], StructBERT [180] Multi-Lingual Multilingual BERT, MASS [147] Unicoder [63], XLM [25], MultiFit [38] Image ViLBERT [107], LXMERT [162], VisualBERT [95], B2T2 [2], VL-BERT [150] Multi-Modal Video VideoBERT [152], CBT [151] Speech SpeechBERT [20] Extensions Knowledge-Enriched ERNIE(THU) [199], KnowBERT [123], K-BERT [101], KEPLER [182], WKLM [188] Domain-Specific BioBERT [91], SciBERT [10], PatentBERT [90] Language-Specific BERT-wwm [26], ZEN [33], NEZHA [184] , ERNIE(Baidu) [157], BERTje [29], CamemBERT [112], FlauBERT [88], RobBERT [31] Pruning CompressingBERT [47] Model Compression Quantization Q-BERT [143], Q8BERT [196] Parameter Sharing ALBERT [86] Distillation DistilBERT [139], TinyBERT [70], MiniLM [181] Figure 3: Taxonomy of PTMs with Representative Examples
分享到:
收藏