logo资料库

从Word2Vec到BERT:上下文嵌入 (Contextual Embedding) 最新综述论文.pdf

第1页 / 共13页
第2页 / 共13页
第3页 / 共13页
第4页 / 共13页
第5页 / 共13页
第6页 / 共13页
第7页 / 共13页
第8页 / 共13页
资料共13页,剩余部分请下载后查看
A Survey on Contextual Embeddings Qi Liu‡, Matt J. Kusner†∗, Phil Blunsom‡⋄, ‡University of Oxford ⋄DeepMind †University College London ∗The Alan Turing Institute ‡{firstname.lastname}@cs.ox.ac.uk †m.kusner@ucl.ac.uk 0 2 0 2 r a M 6 1 ] L C . s c [ 1 v 8 7 2 7 0 . 3 0 0 2 : v i X r a Abstract Contextual embeddings, such as ELMo and BERT, move beyond global word represen- tations like Word2Vec and achieve ground- breaking performance on a wide range of natu- ral language processing tasks. Contextual em- beddings assign each word a representation based on its context, thereby capturing uses of words across varied contexts and encod- ing knowledge that transfers across languages. In this survey, we review existing contextual embedding models, cross-lingual polyglot pre- training, the application of contextual embed- dings in downstream tasks, model compres- sion, and model analyses. 1 Introduction Distributional word representations (Turian et al., 2010; Mikolov et al., 2013; Pennington et al., 2014) trained in an unsupervised manner on large-scale corpora are widely used in modern natural language processing systems. However, these approaches only obtain a single global rep- resentation for each word, ignoring their context. Different from traditional word representations, contextual embeddings move beyond word-level semantics in that each token is associated with a representation that is a function of the entire input sequence. These context-dependent representa- tions can capture many syntactic and semantic properties of words under diverse linguistic Previous work (Peters et al., 2018; contexts. Devlin et al., 2018; Yang et al., 2019; Raffel et al., 2019) has shown that contextual embeddings pre- trained on large-scale unlabelled corpora achieve state-of-the-art performance on a wide range of natural language processing tasks, such as text classification, question answering and text sum- marization. Further analyses (Liu et al., 2019a; Hewitt and Liang, 2019; Hewitt and Manning, 2019; Tenney et al., 2019a) demonstrate that contextual embeddings are capable of learning useful and transferable representations across languages. The rest of the survey is organized as follows. In Section 2, we define the concept of contextual embeddings. In Section 3, we introduce existing methods for obtaining contextual embeddings. In Section 4, we present the pre-training methods of contextual embeddings on multi-lingual corpora. In Section 5, we describe methods for applying pre-trained contextual embeddings in downstream tasks. In Section 6, we detail model compression methods. In Section 7, we survey analyses that have aimed to identify the linguistic knowledge learned by contextual embeddings. We conclude the survey by highlighting some challenges for fu- ture research in Section 8. 2 Token Embeddings Consider a text corpus that is represented as a sequence S of tokens, (t1, t2, ..., tN ). Dis- tributed representations of words (Harris, 1954; Bengio et al., 2003) associate each token ti with a dense feature vector hti . Traditional word em- bedding techniques aim to learn a global word em- bedding matrix E ∈ RV ×d, where V is the vo- cabulary size and d is the number of dimensions. Specifically, each row ei of E corresponds to the global embedding of word type i in the vocabu- lary V . Well-known models for learning word em- beddings include Word2vec (Mikolov et al., 2013) and Glove (Pennington et al., 2014). On the other hand, methods that learn contextual embed- dings associate each token ti with a represen- tation that is a function of the entire input se- quence S, i.e. hti = f (et1 , et2 , ..., etN ), where each input token tj is usually mapped to its non- contextualized representation etj first, before ap- plying an aggregation function f . These context-
dependent representations are better suited to capture sequence-level semantics (e.g. polysemy) than non-contextual word embeddings. There are many model architectures for f , which we review here. We begin by describing pre-training meth- ods for learning contextual embeddings that can be used in downstream tasks. 3 Pre-training Methods for Contextual Embeddings In large part, pre-training contextual embeddings can be divided into either unsupervised methods (e.g. language modelling and its variants) or super- vised methods (e.g. machine translation and natu- ral language inference). 3.1 Unsupervised Pre-training via Language Modeling The prototypical way to learn distributed token embeddings is via language modelling. A lan- guage model is a probability distribution over a sequence of tokens. Given a sequence of N to- kens, (t1, t2, ..., tN ), a language model factorizes the probability of the sequence as: p(t1, t2, ..., tN ) = N Y i=1 p(ti|t1, t2, ..., ti−1). (1) Language modelling uses maximum likelihood estimation (MLE), often penalized with regular- ization terms, to estimate model parameters. A left-to-right language model takes the left con- text, t1, t2, ..., ti−1, of ti into account for esti- mating the conditional probability. Language models are usually trained using large-scale un- labelled corpora. The conditional probabilities are most commonly learned using neural networks (Bengio et al., 2003), and the learned represen- tations have been proven to be transferable to downstream natural language understanding tasks (Dai and Le, 2015; Ramachandran et al., 2016). Precursor Models. Dai and Le (2015) is the first work we are aware of that uses language modelling together with a sequence autoencoder to improve sequence learning with recurrent networks. Thus, it can be thought of as a precursor to modern con- textual embedding methods. Pre-trained on the datasets IMDB, Rotten Tomatoes, 20 Newsgroups, and DBpedia, the model is then fine-tuned on senti- ment analysis and text classification tasks, achiev- ing strong performance compared to randomly- initialized models. Ramachandran et al. (2016) extends Dai and Le (2015) by proposing a pre-training method to improve the accuracy of sequence to sequence (seq2seq) models. The encoder and decoder of the seq2seq model is initialized with the pre-trained weights of two language models. These language models are separately trained on either the News Crawl English or German corpora for machine translation, while both are initialized with the lan- guage model trained with the English Gigaword corpus for abstractive summarization. These pre- trained models are fine-tuned on the WMT En- glish → German task and the CNN/Daily Mail corpus, respectively, achieving better results over baselines without pre-training. The work in the following sections improves over Dai and Le (2015) and Ramachandran et al. (2016) with new architectures (e.g. Transformer), larger datasets, and new pre-training objectives. A summary of the models and the pre-training objec- tives is shown in Table 1 and 2. ELMo. The ELMo model (Peters et al., 2018) generalizes traditional word embeddings by ex- tracting context-dependent representations from a bidirectional language model. A forward L-layer LSTM and a backward L-layer LSTM are applied to encode the left and right contexts, respectively. At each layer j, the contextualized representations are the concatenation of the left-to-right and right- to-left representations, obtaining N hidden repre- sentations, (h1,j , h2,j , ..., hN,j ), for a sequence of length N . To use ELMo in downstream tasks, the (L + 1)- layer representations (including the global word embedding) for each token k are aggregated as: ELMO task k = γtask L X j=0 stask j hk,j, (2) where stask are layer-wise weights normalized by the softmax used to linearly combine the (L + 1)- layer representations of the token k and γtask is a task-specific constant. Given a pre-trained ELMo, it is straightforward to incorporate it into a task-specific architecture for improving the performance. As most super- vised models use global word representations xk in their lowest layers, these representations can be concatenated with their corresponding context- dependent representations ELMO , obtaining task k
Method Architecture Encoder Decoder Objective Dataset ELMo GPT GPT2 BERT RoBERTa ALBERT UniLM ELECTRA XLNet XLM MASS T5 BART LSTM Transformer Transformer Transformer Transformer Transformer Transformer Transformer Transformer Transformer Transformer Transformer Transformer ✗ ✗ ✗ X X X X X ✗ X X X X X X X ✗ ✗ ✗ ✗ ✗ X X X X X LM LM LM MLM & NSP MLM MLM & SOP LM, MLM, seq2seq LM Discriminator (o/r) PLM 1B Word Benchmark BookCorpus Web pages starting from Reddit BookCorpus & Wiki BookCorpus, Wiki, CC-News, OpenWebText, Stories Same as RoBERTa and XLNet Same as BERT Same as XLNet BookCorpus, Wiki, Giga5, ClueWeb, Common Crawl CLM, MLM, TLM Wiki, parellel corpora (e.g. MultiUN) Span Mask Text Infilling WMT News Crawl Colossal Clean Crawled Corpus Text Infilling & Sent Shuffling Same as RoBERTa Table 1: A comparison of popular pre-trained models. Objective LM MLM NSP SOP Discriminator (o/r) PLM seq2seq LM Span Mask Text Infilling Sent Shuffling TLM Inputs [START] Targets I am happy to join with you today I am [MASK] to join with you [MASK] happy today Sent1 [SEP] Next Sent or Sent1 [SEP] Random Sent Next Sent/Random Sent Sent1 [SEP] Sent2 or Sent2 [SEP] Sent1 I am thrilled to study with you today happy join with I am happy to I am [MASK] [MASK] [MASK] with you today I am [MASK] with you today today you am I join with happy to in order/reversed o o r o r o o o today am I to you join with you today happy to join happy to join I am happy to join with you today How [MASK] you [SEP] [MASK] vas-tu are Comment Table 2: Pre-training objectives and their input-output formats. [xk; ELMO layers. task k ], before feeding them to higher The effectiveness of ELMo is evaluated on six NLP problems, including question answering, tex- tual entailment and sentiment analysis. GPT, GPT2, and Grover. GPT (Radford et al., 2018) adopts a two-stage learning paradigm: (a) unsupervised pre-training using a language mod- elling objective and (b) supervised fine-tuning. The goal is to learn universal representations trans- ferable to a wide range of downstream tasks. To this end, GPT uses the BookCorpus dataset (Zhu et al., 2015), which contains more than 7,000 books from various genres, for training the lan- guage model. The Transformer architecture (Vaswani et al., 2017) is used to implement the language model, which has been shown to bet- ter capture global dependencies from the inputs compared to its alternatives, e.g. recurrent net- works, and perform strongly on a range of se- quence learning tasks, such as machine transla- tion (Vaswani et al., 2017) and document gener- ation (Liu et al., 2018). To use GPT on inputs with multiple sequences during fine-tuning, GPT applies task-specific input adaptations motivated by traversal-style approaches (Rockt¨aschel et al., 2015). These approaches pre-process each text input as a single contiguous sequence of tokens through special tokens including [START] (the start of a sequence), [DELIM] (delimiting two se- quences from the text input) and [EXTRACT] (the end of a sequence). GPT outperforms task-specific architectures in 9 out of 12 tasks studied with a pre- trained Transformer. GPT2 (Radford et al., 2019) mainly follows the architecture of GPT and trains a language model on a dataset as large and diverse as possible to learn from varied domains and contexts. To do so, Radford et al. (2019) create a new dataset of millions of web pages named WebText, by scrap- ing outbound links from Reddit. The authors ar- gue that a language model trained on large-scale unlabelled corpora begins to learn some common supervised NLP tasks, such as question answer- ing, machine translation and summarization, with- out any explicit supervision signal. To validate this, GPT2 is tested on ten datasets (e.g. Chil- dren’s Book Test (Hill et al., 2015), LAMBADA (Paperno et al., 2016) and CoQA (Reddy et al.,
2019)) in a zero-shot setting. GPT2 performs strongly on some tasks. For instance, when con- ditioned on a document and questions, GPT2 reaches an F1-score of 55 on the CoQA dataset without using any labelled training data. This matches or outperforms the performance of 3 out of 4 baseline systems. As GPT2 divides texts into bytes and uses BPE (Sennrich et al., 2016) to build up its vocabulary (instead of using characters or words, as in previous work), it is unclear if the im- proved performance comes from the model or the new input representation. Grover (Zellers et al., 2019) creates a news dataset, RealNews, from Common Crawl and pre- trains a language model for generating realistic- looking fake news that is conditioned on meta- data including domains, dates, authors and head- lines. They further study discriminators that can be used to detect fake news. The best defense against Grover turns out to be Grover itself, which sheds light on the importance of releasing trained models for detecting fake news. BERT. ELMo (Peters et al., 2018) concatenates representations from the forward and back- ward LSTMs without considering the interac- tions between the left and right contexts. GPT (Radford et al., 2018) and GPT2 (Radford et al., 2019) use a left-to-right decoder, where every to- ken can only attend to its left context. These archi- tectures are sub-optimal for sentence-level tasks, e.g. named entity recognition and sentiment anal- ysis, as it is crucial to incorporate contexts from both directions. BERT proposes a masked language modelling (MLM) objective, where some of the tokens of a input sequence are randomly masked, and the ob- jective is to predict these masked positions taking the corrupted sequence as input. BERT applies a Transformer encoder to attend to bi-directional contexts during pre-training. In addition, BERT uses a next-sentence-prediction (NSP) objective. Given two input sentences, NSP predicts whether the second sentence is the actual next sentence of the first sentence. The NSP objective aims to im- prove the tasks, such as question answering and natural language inference, which require reason- ing over sentence pairs. Similar to GPT, BERT uses special tokens to obtain a single contiguous sequence for each in- put sequence. Specifically, the first token is al- ways a special classification token [CLS], and sen- tence pairs are separated using a special token [SEP]. BERT adopts a pre-training followed by fine-tuning scheme. The final hidden state of [CLS] is used for sentence-level tasks and the final hidden state of each token is used for token-level tasks. BERT obtains new state-of-the-art results on eleven natural language processing tasks, e.g. improving the GLUE (Wang et al., 2018) score to 80.5%. Similar to GPT2, it is unclear exactly why BERT improves over prior work as it uses differ- ent objectives, datasets (Wikipedia and BookCor- pus) and architectures compared to previous meth- ods. For partial insight on this, we refer the read- ers to (Raffel et al., 2019) for a controlled compar- ison between unidirectional and bidirectional mod- els, traditional language modelling and masked language modelling using the same datasets. BERT variants. Recent work further studies and improves the objective and architecture of BERT. Instead of randomly masking tokens, ERNIE (Zhang et al., 2019) incorporates knowledge mask- ing strategies, including entity-level masking and phrase-level masking. SpanBERT (Joshi et al., 2019) generalizes this idea to mask random spans, without referring to external knowledge. Struct- BERT (Wang et al., 2019b) proposes a word struc- tural objective that randomly permutes the order of 3-grams for reconstruction and a sentence struc- tural objective that predicts the order of two con- secutive segments. RoBERTa (Liu et al., 2019c) makes a few changes to the released BERT model and achieves substantial improvements. The changes include: (1) Training the model longer with larger batches and more data; (2) Removing the NSP objective; (3) Training on longer sequences; (4) Dynami- cally changing the masked positions during pre- training. ALBERT (Lan et al., 2019) proposes two parameter-reduction techniques (factorized em- bedding parameterization and cross-layer param- eter sharing) to lower memory consumption and speed up training. Furthermore, ALBERT argues that the NSP objective lacks difficulty, as the neg- ative examples are created by pairing segments from different documents, this mixes topic predic- tion and coherence prediction into a single task. ALBERT instead uses a sentence-order prediction (SOP) objective. SOP obtains positive examples
by taking out two consecutive segments and neg- ative examples by reversing the order of two con- secutive segments from the same document. XLNet. The XLNet model (Yang et al., 2019) identifies two weaknesses of BERT: 1. BERT assumes conditional independence of corrupted tokens. For instance, to model the probability p(t2 = cat, t6 = mat|t1 = The, t2 = [MASK], t3 = sat, t4 = on, t5 = the, t6 = [MASK]), BERT factorizes it as p(t2 = cat|...)p(t6 = mat|...), where t2 and t6 are assumed to be conditionally independent. 2. The symbols such as [MASK] are introduced by BERT during pre-training, yet they never occur in real data, resulting in a discrepancy between pre-training and fine-tuning. XLNet proposes a new auto-regressive method based on permutation language modelling (PLM) (Uria et al., 2016) without introducing any new symbols. The MLE objective for it is calculated as: max θ Ez∈ZN   N X j=1  log pθ(tzj |tz1 , tz2, ..., tzj−1 )  . (3) For each sequence, XLNet samples a permutation order z = [z1, z2, ..., zN ] from the set of all per- mutations ZN , where |ZN | = N !. The probability of the sequence is factorized according to z, where the zj -th token tzj is conditioned on all the previ- ous tokens tz1 , tz2, ..., tzj according to the permu- tation order z. XLNet further adopts two-stream self-attention and Transformer-XL (Dai et al., 2019) to take into account the target positions zj and learn long- range dependencies, respectively. As the cardinality of ZN is factorial, naive opti- mization would be challenging. Thus, XLNet con- ditions on part of the input and generates the rest of the input to reduce the scale of the search space:  log pθ(tzj |tz1, tz2 , ..., tzj−1 )  , Ez∈ZN   N X max θ j=c+1 (4) where c is the cutting point of the sequence. How- ever, it is tricky to compare XLNet directly with BERT due to the multiple changes in loss and ar- chitecture.1 1We note that RoBERTa, which makes much smaller (a) language modelling, UniLM. UniLM (Dong et al., 2019) adopts three objectives: (b) masked language modelling, and (c) sequence-to- sequence language modelling (seq2seq LM), for pre-training a Transformer network. To implement three objectives in a single network, UniLM uti- lizes specific self-attention masks to control what context the prediction conditions on. For exam- ple, MLM can attend to its bidirectional contexts, while seq2seq LM can attend to bidirectional con- texts for source sequences and left contexts only for target sequences. ELECTRA. Compared to BERT, ELECTRA (Clark et al., 2019) proposes a more effective pre- training method. Instead of corrupting some posi- tions of inputs with [MASK], ELECTRA replaces some tokens of the inputs with their plausible alternatives sampled from a small generator net- work. ELECTRA trains a discriminator to predict whether each token in the corrupted input was re- placed by the generator or not. The pre-trained dis- criminator can then be used in downstream tasks for fine-tuning, improving upon the pre-trained representation learned by the generator. MASS. Although BERT achieves state-of-the- art performance for many natural language understanding tasks, BERT cannot be easily used for natural language generation. MASS (Song et al., 2019) uses masked sequences to pre- train sequence-to-sequence models. More specif- ically, MASS adopts an encoder-decoder frame- work and extends the MLM objective. The en- coder takes as input a sequence where consecu- tive tokens are masked and the decoder predicts these masked consecutive tokens autoregressively. MASS achieves significant improvements over baselines without pre-training or with other pre- training methods on a variety of zero/low-resource language generation tasks, including neural ma- chine translation, text summarization and conver- sational response generation. T5. Raffel et al. (2019) propose T5 (Text-to- Text Transfer Transformer), unifying natural lan- guage understanding and generation by converting the data into a text-to-text format and applying a encoder-decoder framework. changes to BERT is able to outperform XLNet. Future study needs to be done to understand the precise advantages of XL- Net’s modifications to BERT.
T5 introduces a new pre-training dataset, Colos- sal Clean Crawled Corpus by cleaning the web pages from Common Crawl. T5 also system- atically compares previous methods in terms of pre-training objectives, architectures, pre-training datasets, and transfer approaches. T5 adopts a text infilling objective (where spans of text are re- placed with a single mask token), longer training, multi-task pre-training on GLUE or SuperGLUE, fine-tuning on each individual GLUE and Super- GLUE tasks, and beam search. For fine-tuning, to convert the input data into a text-to-text framework, T5 utilizes the token vo- cabulary of the decoder as the prediction labels. For example, the tokens “entailment”, “contradic- tion”, and “neutral” are used as the labels for nat- ural language inference tasks. For the regression task (e.g. STS-B (Cer et al., 2017)), T5 simply rounds up the scores to the nearest multiple of 0.2 and converts the results to literal string represen- tations (e.g. 2.57 is converted to the string “2.6”). T5 also adds a task-specific prefix to each input se- quence to specify its task. For instance, T5 adds the prefix “translate English to German” to each input sequence like “That is good.” for English-to- German translation datasets. Then, BART. The BART model (Lewis et al., 2019) introduces additional noising functions beyond MLM for pre-training sequence-to-sequence mod- els. First, the input sequence is corrupted us- ing an arbitrary noising function. the is reconstructed by a Trans- corrupted input former network trained using teacher forcing (Williams and Zipser, 1989). BART evaluates a wide variety of noising functions, including token masking, token deletion, text infilling, document rotation, and sentence shuffling (randomly shuf- fling the word order of a sentence). The best per- formance is achieved by using both sentence shuf- fling and text infilling. BART matches the perfor- mance of RoBERTa on GLUE and SQuAD and achieves state-of-the-art performance on a variety of text generation tasks. 3.2 Supervised Objectives Pre-training on the ImageNet dataset (which has supervision about the objects in images) before fine-tuning on downstream tasks has become the de facto standard in the computer vision community. Motivated by the success of su- pervised pre-training in computer vision, some work (Conneau et al., 2017; McCann et al., 2017; Subramanian et al., 2018) utilizes data-rich tasks in NLP to learn transferable representations. CoVe (McCann et al., 2017) shows that the rep- resentations learned from machine translation are transferable to downstream tasks. CoVe uses a deep LSTM encoder from a sequence-to-sequence model trained for machine translation to obtain contextual embeddings. Empirical results show that augmenting non-contextualized word repre- sentations (Mikolov et al., 2013; Pennington et al., 2014) with CoVe embeddings improves perfor- mance over a wide variety of common NLP tasks, such as sentiment analysis, question classifica- tion, entailment, and question answering. In- ferSent (Conneau et al., 2017) obtains contextual- ized representations from a pre-trained natural lan- guage inference model on SNLI. Subramanian et (2018) use multi-task learning to pre-train al. a sequence-to-sequence model for obtaining gen- eral representations, where the tasks include skip- thought (Kiros et al., 2015), machine translation, constituency parsing, and natural language infer- ence. 4 Cross-lingual Polyglot Pre-training for Contextual Embeddings representations, Cross-lingual polyglot pre-training aims to learn joint multi-lingual enabling knowledge transfer from data-rich languages like English to data-scarce languages like Romanian. Based on whether joint training and a shared vocabulary are used, we divide previous work into three categories. Joint training & shared vocabulary. Artetxe and Schwenk (2019) use a BiLSTM encoder- decoder framework with a shared BPE vocabulary for 93 languages. The framework is pre-trained using parallel corpora, including as Europarl and Tanzil. The contextual embeddings from the en- coder are used to train classifiers using English corpora for downstream tasks. As the embedding space and the encoder are shared, the resultant classifiers can be transferred to any of the 93 lan- guages without further modification. Experiments show that these classifiers achieve competitive per- formance on cross-lingual natural language infer- ence, cross-lingual document classification, and parallel corpus mining. Rosita (Mulcaire et al., 2019) pre-trains a lan- guage model using text from different languages,
showing the benefits of polyglot learning on low- resource languages. Recently, the authors of BERT developed a multi-lingual BERT2 which is pre-trained using the Wikipedia dump with more than 100 lan- guages. (1) Causal XLM (Lample and Conneau, 2019) uses three pre-training methods for learning cross-lingual language models: language mod- elling, where the model is trained to predict p(ti|t1, t2, ..., ti−1), (2) Masked language mod- elling, and (3) Translation language modelling (TLM). Parallel corpora are used, and tokens in both source and target sequences are masked for learning cross-lingual association. XLM performs strongly on cross-lingual classification, unsuper- vised machine translation, and supervised ma- chine translation. XLM-R (Conneau et al., 2019) scales up XLM by training a Transformer-based masked language model on one hundred lan- guages, using more than two terabytes of filtered CommonCrawl data. XLM-R shows that large- scale multi-lingual pre-training leads to signifi- cant performance gains for a wide range of cross- lingual transfer tasks. Joint training & separate vocabularies. Wu et (2019) study the emergence of cross-lingual al. structures in pre-trained multi-lingual language models. It is found that cross-lingual transfer is possible even when there is no shared vocabulary across the monolingual corpora, and there are uni- versal latent symmetries in the embedding spaces of different languages. Separate training & separate vocabularies. Artetxe et al. (2019) use a four-step method for obtaining multi-lingual embeddings. Suppose we have the monolingual sequences of two languages L1 and L2: (1) Pre-training BERT with the vo- cabulary of L1 using L1’s monolingual data. (2) Replacing the vocabulary of L1 with the vocab- ulary of L2 and training new vocabulary embed- dings, while freezing the other parameters, using L2’s monolingual data. (3) Fine-tuning the BERT model for a downstream task using labeled data in L1, while freezing L1’s vocabulary embeddings. (4) Replacing the fine-tuned BERT with L2’s vo- cabulary embeddings for zero-shot transfer tasks. 2https://github.com/google- research/bert/blob/master/multilingual.md 5 Downstream Learning Once learned, contextual embeddings have demon- strated impressive performance when used down- stream on various learning problems. Here we de- scribe the ways in which contextual embeddings are used downstream, the ways in which one can avoid forgetting information in the embeddings during downstream learning, and how they can be specialized to multiple learning tasks. 5.1 Ways to Use Contextual Embeddings Downstream There are three main ways to use pre-trained contextual embeddings in downstream tasks: (1) Feature-based methods, (2) Fine-tuning methods, and (3) Adapter methods. Feature-based. One example of a feature-based is the method used by ELMo (Peters et al., 2018). Specifically, as shown in equation 2, ELMo freezes the weights of the pre-trained contextual embedding model and forms a linear combina- tion of its internal representations. The linearly- combined representations are then used as fea- tures for task-specific architectures. The benefit of feature-based models is that they can use state-of- the-art handcrafted architectures for specific tasks. as Fine-tuning. Fine-tuning works follows: starting with the weights of the pre-trained contex- tual embedding model, fine-tuning makes small adjustments to them in order to specialize them to a specific downstream task. One stream of work applies minimal changes to pre-trained models to take full advantage of their parameters. The most straightforward way is adding linear layers on top of the pre-trained models (Devlin et al., 2018; Lan et al., 2019). Another method (Radford et al., 2019; Raffel et al., 2019) uses universal data formats without introducing new parameters for downstream tasks. To apply pre-trained models to structurally different tasks, where task-specific architectures is initialized are used, as much of the model with pre-trained weights as possible. For in- stance, XLM (Lample and Conneau, 2019) ap- plies two pre-trained monolingual language mod- els to initialize the encoder and the decoder for ma- chine translation, respectively, leaving only cross- attention weights randomly initialized. Adapters. Adapters 2017; Stickland and Murray, 2019) are small modules (Rebuffi et al.,
added between layers of pre-trained models to be trained in a multi-task learning setting. The parameters of the pre-trained model are fixed while tuning these adapter modules. Compared to previous work that fine-tunes a separate pre-trained model for each task, a model with shared adapters for all tasks often requires fewer parameters. 5.2 Countering Catastrophic Forgetting Learning on downstream tasks is prone to over- write the information from pre-trained mod- els, which is widely known as the catas- trophic forgetting (McCloskey and Cohen, 1989; d’Autume et al., 2019). Previous work combats this by (1) Freezing layers, (2) Using adaptive learning rates, and (3) Regularization. Freezing layers. Motivated by layer-wise train- ing of neural networks (Hinton et al., 2006), train- ing certain layers while freezing others can poten- tially reduce forgetting during fine-tuning. Differ- ent layer-wise tuning schedules have been studied. Long et al. (2015) freeze all layers except the top layer. Felbo et al. (2017) use “chain-thaw”, which sequentially unfreezes and fine-tunes a layer at a time. Howard and Ruder (2018) gradually un- freeze all layers one by one from top to bottom. (2019) apply a three-stage Chronopoulou et al. fine-tuning schedule: (a) randomly-initialized pa- rameters are updated for n epochs, (b) the pre- trained parameters (except word embeddings) are then fine-tuned, (c) at last, all parameters are fine- tuned. Adaptive learning rates. Another method to mitigate catastrophic forgetting is by using adap- tive learning rates. As it is believed that the lower layers of pre-trained models tend to capture gen- eral language knowledge (Tenney et al., 2019a), Howard and Ruder (2018) use lower learning rates for lower layers when fine-tuning. Regularization. Regularization limits the fine- tuned parameters to be close to the pre-trained pa- rameters. Wiese et al. (2017) minimize the Eu- clidean distance between the fine-tuned parame- ters and pre-trained parameters. Kirkpatrick et al. (2017) use the Fisher information matrix to pro- tect the weights that are identified as essential for pre-trained models. 5.3 Multi-task Fine-tuning on Wang et al., learning 2019b; downstream tasks Multi-task 2019a; (Liu et al., Jozefowicz et al., 2016) obtains general rep- resentations across tasks and achieves strong performance on each individual task. MT-DNN (Liu et al., 2019b) fine-tunes BERT on all the GLUE tasks, improving the GLUE benchmark to 82.7%. MT-DNN also demonstrates that the representations from multi-task learning obtain better performance on domain adaptation compared to BERT. Wang et al. (2019a) investigate further, non- GLUE tasks, such as skip-thought and Reddit re- sponse generation, for multi-task learning. T5 (Raffel et al., 2019) studies various settings of multi-task learning and finds that using multi- task learning before fine-tuning on each task per- forms the best. 6 Model Compression To address As many pre-trained language models have a prohibitive memory footprint and latency, it is a challenging task to deploy them in resource- constrained environments. this, model compression (Cheng et al., 2017), which has gained popularity in recent years for shrinking large neural networks, has been investigated for compressing contextual embedding models. Work on compressing language models utilizes (1) Low- rank approximation, (2) Knowledge distillation, and (3) Weight quantization, to make them usable in embedded systems and edge devices. Low rank approximation. Methods that learn low rank approximations seek to compress the full-rank model weight matrices into low-rank matrices, thereby reducing the effective number of model parameters. As the embedding matri- ces usually account for a large portion of model parameters (e.g. 21% for BERTBase), ALBERT (Lan et al., 2019) approximates the embedding matrix E ∈ RV ×d as the product of two smaller and E2 ∈ Rd′×d, where matrices, E1 ∈ RV ×d′ d′ ≪ d. Knowledge distillation. A method called ‘knowledge distillation’ was proposed by Hinton (2015), where the ‘knowledge’ encoded et al. in a teacher network is transferred to a student network. Hinton et al. (2015) use the soft target probabilities, output by the teacher network, to
分享到:
收藏