logo资料库

ID-CNN+CRF.pdf

第1页 / 共13页
第2页 / 共13页
第3页 / 共13页
第4页 / 共13页
第5页 / 共13页
第6页 / 共13页
第7页 / 共13页
第8页 / 共13页
资料共13页,剩余部分请下载后查看
Fast and Accurate Entity Recognition with Iterated Dilated Convolutions Emma Strubell Patrick Verga David Belanger College of Information and Computer Sciences {strubell, pat, belanger, mccallum}@cs.umass.edu University of Massachusetts Amherst Andrew McCallum 7 1 0 2 l u J 2 2 ] L C . s c [ 3 v 8 9 0 2 0 . 2 0 7 1 : v i X r a Abstract Today when many practitioners run basic NLP on the entire web and large-volume traffic, faster methods are paramount to saving time and energy costs. Recent advances in GPU hardware have led to the emergence of bi-directional LSTMs as a standard method for obtaining per- token vector representations serving as in- put to labeling tasks such as NER (often followed by prediction in a linear-chain CRF). Though expressive and accurate, these models fail to fully exploit GPU par- allelism, limiting their computational ef- ficiency. This paper proposes a faster al- ternative to Bi-LSTMs for NER: Iterated Dilated Convolutional Neural Networks (ID-CNNs), which have better capacity than traditional CNNs for large context and structured prediction. Unlike LSTMs whose sequential processing on sentences of length N requires O(N ) time even in the face of parallelism, ID-CNNs permit fixed-depth convolutions to run in paral- lel across entire documents. We describe a distinct combination of network struc- ture, parameter sharing and training pro- cedures that enable dramatic 14-20x test- time speedups while retaining accuracy comparable to the Bi-LSTM-CRF. More- over, ID-CNNs trained to aggregate con- text from the entire document are even more accurate while maintaining 8x faster test time speeds. Introduction 1 In order to democratize large-scale NLP and in- formation extraction while minimizing our en- vironmental footprint, we require fast, resource- efficient methods for sequence tagging tasks such as part-of-speech tagging and named entity recog- nition (NER). Speed is not sufficient of course: they must also be expressive enough to tolerate the tremendous lexical variation in input data. The massively parallel computation facilitated by GPU hardware has led to a surge of success- ful neural network architectures for sequence la- beling (Ling et al., 2015; Ma and Hovy, 2016; Chiu and Nichols, 2016; Lample et al., 2016). While these models are expressive and accurate, they fail to fully exploit the parallelism opportu- nities of a GPU, and thus their speed is limited. Specifically, they employ either recurrent neural networks (RNNs) for feature extraction, or Viterbi inference in a structured output model, both of which require sequential computation across the length of the input. Instead, parallelized runtime independent of the length of the sequence saves time and energy costs, maximizing GPU resource usage and min- imizing the amount of time it takes to train and evaluate models. Convolutional neural networks (CNNs) provide exactly this property (Kim, 2014; Kalchbrenner et al., 2014). Rather than compos- ing representations incrementally over each token in a sequence, they apply filters in parallel across the entire sequence at once. Their computational cost grows with the number of layers, but not the input size, up to the memory and threading limita- tions of the hardware. This provides, for example, audio generation models that can be trained in par- allel (van den Oord et al., 2016). Despite the clear computational advantages of CNNs, RNNs have become the standard method for composing deep representations of text. This is because a token encoded by a bidirectional RNN will incorporate evidence from the entire input se- quence, but the CNN’s representation is limited by
the effective input width1 of the network: the size of the input context which is observed, directly or indirectly, by the representation of a token at a given layer in the network. Specifically, in a network composed of a series of stacked convo- lutional layers of convolution width w, the num- ber r of context tokens incorporated into a to- ken’s representation at a given layer l, is given by r = l(w − 1) + 1. The number of layers required to incorporate the entire input context grows lin- early with the length of the sequence. To avoid this scaling, one could pool representations across the sequence, but this is not appropriate for sequence labeling, since it reduces the output resolution of the representation. In response, this paper presents an application of dilated convolutions (Yu and Koltun, 2016) for sequence labeling (Figure 1). For dilated convo- lutions, the effective input width can grow expo- nentially with the depth, with no loss in resolu- tion at each layer and with a modest number of parameters to estimate. Like typical CNN layers, dilated convolutions operate on a sliding window of context over the sequence, but unlike conven- tional convolutions, the context need not be con- secutive; the dilated window skips over every dila- tion width d inputs. By stacking layers of dilated convolutions of exponentially increasing dilation width, we can expand the size of the effective input width to cover the entire length of most sequences using only a few layers: The size of the effective input width for a token at layer l is now given by 2l+1−1. More concretely, just four stacked dilated convolutions of width 3 produces token represen- tations with a n effective input width of 31 tokens – longer than the average sentence length (23) in the Penn TreeBank. Our overall iterated dilated CNN architecture (ID-CNN) repeatedly applies the same block of di- lated convolutions to token-wise representations. This parameter sharing prevents overfitting and also provides opportunities to inject supervision on intermediate activations of the network. Simi- lar to models that use logits produced by an RNN, the ID-CNN provides two methods for perform- ing prediction: we can predict each token’s label independently, or by running Viterbi inference in a chain structured graphical model. In experiments on CoNLL 2003 and OntoNotes Figure 1: A dilated CNN block with maximum dilation width 4 and filter width 3. Neurons con- tributing to a single highlighted neuron in the last layer are also highlighted. 5.0 English NER, we demonstrate significant speed gains of our ID-CNNs over various recur- rent models, while maintaining similar F1 perfor- mance. When performing prediction using inde- pendent classification, the ID-CNN consistently outperforms a bidirectional LSTM (Bi-LSTM), and performs on par with inference in a CRF with logits from a Bi-LSTM (Bi-LSTM-CRF). As an extractor of per-token logits for a CRF, our model out-performs the Bi-LSTM-CRF. We also apply ID-CNNs to entire documents, where inde- pendent token classification is as accurate as the Bi-LSTM-CRF while decoding almost 8× faster. The clear accuracy gains resulting from incorpo- rating broader context suggest that these mod- els could similarly benefit many other context- sensitive NLP tasks which have until now been limited by the computational complexity of exist- ing context-rich models.2 2 Background 2.1 Conditional Probability Models for Tagging Let x = [x1, . . . , xT ] be our input text and y = [y1, . . . , yT ] be per-token output tags. Let D be the domain size of each yi. We predict the most likely y, given a conditional model P (y|x). This paper considers two factorizations of the conditional distribution. First, we have T P (y|x) = P (yt|F (x)), (1) t=1 where the tags are conditionally independent given some features for x. Given these features, O(D) prediction is simple and parallelizable across the 1What we call effective input width here is known as the receptive field in the vision literature, drawing an analogy to the visual receptive field of a neuron in the retina. 2Our implementation in TensorFlow (Abadi et al., 2015) is available at: https://github.com/iesl/ dilated-cnn-ner
length of the sequence. However, feature extrac- tion may not necessarily be parallelizable. For example, RNN-based features require iterative passes along the length of x. We also consider a linear-chain CRF model that couples all of y together: P (y|x) = 1 Zx ψt(yt|F (x))ψp(yt, yt−1), (2) T t=1 where ψt is a local factor, ψp is a pairwise factor that scores consecutive tags, and Zx is the parti- tion function (Lafferty et al., 2001). To avoid over- fitting, ψp does not depend on the timestep t or the input x in our experiments. Prediction in this model requires global search using the O(D2T ) Viterbi algorithm. CRF prediction explicitly reasons about inter- actions among neighboring output tags, whereas prediction in the first model compiles this reason- ing into the feature extraction step (Liang et al., 2008). The suitability of such compilation de- pends on the properties and quantity of the data. While CRF prediction requires non-trivial search in output space, it can guarantee that certain output constraints, such as for IOB tagging (Ramshaw and Marcus, 1999), will always be satisfied. It may also have better sample complexity, as it im- poses more prior knowledge about the structure of the interactions among the tags (London et al., 2016). However, it has worse computational com- plexity than independent prediction. 3 Dilated Convolutions CNNs in NLP are typically one-dimensional, ap- plied to a sequence of vectors representing tokens rather than to a two-dimensional grid of vectors representing pixels. In this setting, a convolutional neural network layer is equivalent to applying an affine transformation, Wc to a sliding window of width r tokens on either side of each token in the sequence. Here, and throughout the paper, we do not explicitly write the bias terms in affine trans- formations. The convolutional operator applied to each token xt with output ct is defined as: ct = Wc xt±k, (3) where ⊕ is vector concatenation. k=0 Dilated convolutions perform the same opera- tion, except rather than transforming adjacent in- r r puts, the convolution is defined over a wider ef- fective input width by skipping over δ inputs at a time, where δ is the dilation width. We define the dilated convolution operator: ct = Wc xt±kδ. (4) k=0 A dilated convolution of width 1 is equivalent to a simple convolution. Using the same number of parameters as a simple convolution with the same radius (i.e. Wc has the same dimensionality), the δ > 1 dilated convolution incorporates broader context into the representation of a token than a simple convolution. 3.1 Multi-Scale Context Aggregation We can leverage the ability of dilated convolutions to incorporate global context without losing im- portant local information by stacking dilated con- volutions of increasing width. First described for pixel classification in computer vision, Yu and Koltun (2016) achieve state-of-the-art results on image segmentation benchmarks by stacking di- lated convolutions with exponentially increasing rates of dilation, a technique they refer to as multi- scale context aggregation. By feeding the out- puts of each dilated convolution as the input to the next, increasingly non-local information is incor- porated into each pixel’s representation. Perform- ing a dilation-1 convolution in the first layer en- sures that no pixels within the effective input width of any pixel are excluded. By doubling the dila- tion width at each layer, the size of the effective input width grows exponentially while the number of parameters grows only linearly with the number of layers, so a pixel representation quickly incor- porates rich global evidence from the entire im- age. 4 Iterated Dilated CNNs Stacked dilated CNNs can easily incorporate global information from a whole sentence or docu- ment. For example, with a radius of 1 and 4 layers of dilated convolutions, the effective input width of each token is width 31, which exceeds the av- erage sentence length (23) in the Penn TreeBank corpus. With a radius of size 2 and 8 layers of dilated convolutions, the effective input width ex- ceeds 1,000 tokens, long enough to encode a full newswire document.
Unfortunately, simply increasing the depth of stacked dilated CNNs causes considerable over- fitting in our experiments. In response, we present Iterated Dilated CNNs (ID-CNNs), which instead apply the same small stack of dilated convolutions multiple times, each iterate taking as input the re- sult of the last application. Repeatedly employing the same parameters in a recurrent fashion pro- vides both broad effective input width and desir- able generalization capabilities. We also obtain significant accuracy gains with a training objec- tive that strives for accurate labeling after each it- erate, allowing follow-on iterations to observe and resolve dependency violations. 4.1 Model Architecture The network takes as input a sequence of T vec- tors xt, and outputs a sequence of per-class scores ht, which serve either as the local conditional dis- tributions of Eqn. (1) or the local factors ψt of Eqn. (2). We denote the jth dilated convolutional layer of . The first layer in the net- that trans- dilation width δ as D(j) δ work is a dilation-1 convolution D(0) 1 forms the input to a representation it: it = D(0) 1 xt (5) Next, Lc layers of dilated convolutions of expo- nentially increasing dilation width are applied to it, folding in increasingly broader context into the embedded representation of xt at each layer. Let r() denote the ReLU activation function (Glorot et al., 2011). Beginning with ct (0) = it we define the stack of layers with the following recurrence: and add a final dilation-1 layer to the stack: ct (j) = r D(j−1) 2Lc−1ct ct (Lc+1) = r D(Lc) 1 ct (j−1) (Lc) (6) (7) We refer to this stack of dilated convolutions as a block B(·), which has output resolution equal to its input resolution. To incorporate even broader context without over-fitting, we avoid making B deeper, and instead iteratively apply B Lb times, introducing no extra parameters. Starting with bt (1) = B (it): (k−1) bt (k) = B bt We apply a simple affine transformation Wo to this final representation to obtain per-class scores for each token xt: ht (Lb) = Wobt (Lb) (9) 4.2 Training Our main focus is to apply the ID-CNN an en- coder to produce per-token logits for the first con- ditional model described in Sec. 2.1, where tags are conditionally independent given deep features, since this will enable prediction that is paralleliz- able across the length of the input sequence. Here, maximum likelihood training is straightforward because the likelihood decouples into the sum of the likelihoods of independent logistic regression problems for every tag, with natural parameters given by Eqn. (9): log P (yt | ht (Lb)) (10) T t=1 1 T We can also use the ID-CNN as logits for the CRF model (Eqn. (2)), where the partition function and its gradient are computed using the forward-backward algorithm. We next present an alternative training method that helps bridge the gap between these two tech- niques. Sec. 2.1 identifies that the CRF has prefer- able sample complexity and accuracy since pre- diction directly reasons in the space of structured outputs. In response, we compile some of this rea- soning in output space into ID-CNN feature ex- traction. Instead of explicit reasoning over output labels during inference, we train the network such that each block is predictive of output labels. Sub- sequent blocks learn to correct dependency viola- tions of their predecessors, refining the final se- quence prediction. To do so, we first define predictions of the model after each of the Lb applications of the (k) be the result of applying the ma- block. Let ht (k), the output of block k. trix Wo from (9) to bt We minimize the average of the losses for each application of the block: T Lb 1 T k=1 t=1 1 Lb log P (yt | ht (k)). (11) By rewarding accurate predictions after each application of the block, we learn a model where later blocks are used to refine initial predictions. (8)
The loss also helps reduce the vanishing gradi- ent problem (Hochreiter, 1998) for deep architec- tures. Such an approach has been applied in a va- riety of contexts for training very deep networks in computer vision (Romero et al., 2014; Szegedy et al., 2015; Lee et al., 2015; G¨ulc¸ehre and Bengio, 2016), but not to our knowledge in NLP. We apply dropout (Srivastava et al., 2014) to the (b) to raw inputs xt and to each block’s output bt help prevent overfitting. The version of dropout typically used in practice has the undesirable prop- erty that the randomized predictor used at train time differs from the fixed one used at test time. Ma et al. (2017) present dropout with expectation- linear regularization, which explicitly regularizes these two predictors to behave similarly. All of our best reported results include such regularization. This is the first investigation of the technique’s ef- fectiveness for NLP, including for RNNs. We en- courage its further application. 5 Related work The state-of-the art models for sequence labeling include an inference step that searches the space of possible output sequences of a chain-structured graphical model, or approximates this search with a beam (Collobert et al., 2011; Weiss et al., 2015; Lample et al., 2016; Ma and Hovy, 2016; Chiu and Nichols, 2016). These outperform similar systems that use the same features, but independent local predictions. On the other hand, the greedy sequen- tial prediction (Daum´e III et al., 2009) approach of Ratinov and Roth (2009), which employs lex- icalized features, gazetteers, and word clusters, outperforms CRFs with similar features. LSTMs (Hochreiter and Schmidhuber, 1997) were used for NER as early as the CoNLL shared task in 2003 (Hammerton, 2003; Tjong Kim Sang and De Meulder, 2003). More re- cently, a wide variety of neural network architec- tures for NER have been proposed. Collobert et al. (2011) employ a one-layer CNN with pre-trained word embeddings, capitalization and lexicon fea- tures, and CRF-based prediction. Huang et al. (2015) achieved state-of-the-art accuracy on part- of-speech, chunking and NER using a Bi-LSTM- CRF. Lample et al. (2016) proposed two mod- els which incorporated Bi-LSTM-composed char- acter embeddings alongside words: a Bi-LSTM- CRF, and a greedy stack LSTM which uses a simple shift-reduce grammar to compose words into labeled entities. Their Bi-LSTM-CRF ob- tained the state-of-the-art on four languages with- out word shape or lexicon features. Ma and Hovy (2016) use CNNs rather than LSTMs to compose characters in a Bi-LSTM-CRF, achieving state-of- the-art performance on part-of-speech tagging and CoNLL NER without lexicons. Chiu and Nichols (2016) evaluate a similar network but propose a novel method for encoding lexicon matches, pre- senting results on CoNLL and OntoNotes NER. Yang et al. (2016) use GRU-CRFs with GRU- composed character embeddings of words to train a single network on many tasks and languages. In general, distributed representations for text can provide useful generalization capabilities for NER systems, since they can leverage unsuper- vised pre-training of distributed word representa- tions (Turian et al., 2010; Collobert et al., 2011; Passos et al., 2014). Though our models would also likely benefit from additional features such as character representations and lexicons, we focus on simpler models which use word-embeddings alone, leaving more elaborate input representa- tions to future work. In these NER approaches, CNNs were used for low-level feature extraction that feeds into alter- native architectures. Overall, end-to-end CNNs have mainly been used in NLP for sentence classi- fication, where the output representation is lower resolution than that of the input Kim (2014); Kalchbrenner et al. (2014); Zhang et al. (2015); Toutanova et al. (2015). Lei et al. (2015) present a CNN variant where convolutions adaptively skip neighboring words. While the flexibility of this model is powerful, its adaptive behavior is not well-suited to GPU acceleration. Our work draws on the use of dilated convolu- tions for image segmentation in the computer vi- sion community (Yu and Koltun, 2016; Chen et al., 2015). Similar to our block, Yu and Koltun (2016) employ a context-module of stacked dilated convo- lutions of exponentially increasing dilation width. Dilated convolutions were recently applied to the task of speech generation (van den Oord et al., 2016), and concurrent with this work, Kalchbren- ner et al. (2016) posted a pre-print describing the similar ByteNet network for machine translation that uses dilated convolutions in the encoder and decoder components. Our basic model architec- ture is similar to that of the ByteNet encoder, ex- cept that the inputs to our model are tokens and
not bytes. Additionally, we present a novel loss and parameter sharing scheme to facilitate training models on much smaller datasets than those used by Kalchbrenner et al. (2016). We are the first to use dilated convolutions for sequence labeling. The broad effective input width of the ID-CNN helps aggregate document-level context. Ratinov and Roth (2009) incorporate document context in their greedy model by adding features based on tagged entities within a large, fixed window of to- kens. Prior work has also posed a structured model that couples predictions across the whole docu- ment (Bunescu and Mooney, 2004; Sutton and McCallum, 2004; Finkel et al., 2005). 6 Experimental Results We describe experiments on two benchmark En- glish named entity recognition datasets. On CoNLL-2003 English NER, our ID-CNN per- forms on par with a Bi-LSTM not only when used to produce per-token logits for structured infer- ence, but the ID-CNN with greedy decoding also performs on-par with the Bi-LSTM-CRF while running at more than 14 times the speed. We also observe a performance boost in almost all models when broadening the context to incorporate entire documents, achieving an average F1 of 90.65 on CoNLL-2003, out-performing the sentence-level model while still decoding at nearly 8 times the speed of the Bi-LSTM-CRF. 6.1 Data and Evaluation We evaluate using labeled data from the CoNLL- 2003 shared task (Tjong Kim Sang and De Meul- der, 2003) and OntoNotes 5.0 (Hovy et al., 2006; Pradhan et al., 2006). Following previous work, we use the same OntoNotes data split used for co-reference resolution in the CoNLL-2012 shared task (Pradhan et al., 2012). For both datasets, we convert the IOB boundary encoding to BILOU as previous work found this encoding to result in im- proved performance (Ratinov and Roth, 2009). As in previous work we evaluate the performance of our models using segment-level micro-averaged F1 score. Hyperparameters that resulted in the best performance on the validation set were se- lected via grid search. A more detailed descrip- tion of the data, evaluation, optimization and data pre-processing can be found in the Appendix. 6.2 Baselines We compare our ID-CNN against strong LSTM and CNN baselines: a Bi-LSTM with local de- coding, and one with CRF decoding (Bi-LSTM- CRF). We also compare against a non-dilated CNN architecture with the same number of con- volutional layers as our dilated network (4-layer CNN) and one with enough layers to incorporate an effective input width of the same size as that of the dilated network (5-layer CNN) to demon- strate that the dilated convolutions more effec- tively aggregate contextual information than sim- ple convolutions (i.e. using fewer parameters). We also compare our document-level ID-CNNs to a baseline which does not share parameters between blocks (noshare) and one that computes loss only at the last block, rather than after every iterated block of dilated convolutions (1-loss). We do not compare with deeper or more elab- orate CNN architectures for a number of reasons: 1) Fast train and test performance are highly desir- able for NLP practitioners, and deeper models re- quire more computation time 2) more complicated models tend to over-fit on this relatively small dataset and 3) most accurate deep CNN architec- tures repeatedly up-sample and down-sample the inputs. We do not compare to stacked LSTMs for similar reasons — a single LSTM is already slower than a 4-layer CNN. Since our task is se- quence labeling, we desire a model that maintains the token-level resolution of the input, making di- lated convolutions an elegant solution. 6.3 CoNLL-2003 English NER 6.3.1 Sentence-level prediction Table 1 lists F1 scores of models predicting with sentence-level context on CoNLL-2003. For mod- els that we trained, we report F1 and standard deviation obtained by averaging over 10 random restarts. The Viterbi-decoding Bi-LSTM-CRF and ID-CNN-CRF and greedy ID-CNN obtain the highest average scores, with the ID-CNN- CRF outperforming the Bi-LSTM-CRF by 0.11 points of F1 on average, and the Bi-LSTM-CRF out-performing the greedy ID-CNN by 0.11 as well. Our greedy ID-CNN outperforms the Bi- LSTM and the 4-layer CNN, which uses the same number of parameters as the ID-CNN, and per- forms similarly to the 5-layer CNN which uses more parameters but covers the same effective in- put width. All CNN models out-perform the Bi-
Model Ratinov and Roth (2009) Collobert et al. (2011) Lample et al. (2016) Bi-LSTM 4-layer CNN 5-layer CNN ID-CNN Collobert et al. (2011) Passos et al. (2014) Lample et al. (2016) Bi-LSTM-CRF (re-impl) ID-CNN-CRF F1 86.82 86.96 90.33 89.34 ± 0.28 89.97 ± 0.20 90.23 ± 0.16 90.32 ± 0.26 88.67 90.05 90.20 90.43 ± 0.12 90.54 ± 0.18 Table 1: F1 score of models observing sentence- level context. No models use character embed- dings or lexicons. Top models are greedy, bottom models use Viterbi inference . LSTM when paired with greedy decoding, sug- gesting that CNNs are better token encoders than Bi-LSTMs for independent logistic regression. When paired with Viterbi decoding, our ID-CNN performs on par with the Bi-LSTM, showing that the ID-CNN is also an effective token encoder for structured inference. Our ID-CNN is not only a better token encoder than the Bi-LSTM but it is also faster. Table 2 lists relative decoding times on the CoNLL devel- opment set, compared to the Bi-LSTM-CRF. We report decoding times using the fastest batch size for each method.3 The ID-CNN model decodes nearly 50% faster than the Bi-LSTM. With Viterbi decoding, the gap closes somewhat but the ID-CNN-CRF still comes out ahead, about 30% faster than the Bi-LSTM- CRF. The most vast speed improvements come when comparing the greedy ID-CNN to the Bi- LSTM-CRF – our ID-CNN is more than 14 times faster than the Bi-LSTM-CRF at test time, with comparable accuracy. The 5-layer CNN, which observes the same effective input width as the ID- CNN but with more parameters, performs at about the same speed as the ID-CNN in our experiments. With a better implementation of dilated convolu- tions than currently included in TensorFlow, we would expect the ID-CNN to be notably faster than 3For each model, we tried batch sizes b = 2i with i = 0...11. At scale, speed should increase with batch size, as we could compose each batch of as many sentences of the same length as would fit in GPU memory, requiring no padding and giving CNNs and ID-CNNs even more of a speed advantage. Speed Model Bi-LSTM-CRF 1× Bi-LSTM ID-CNN-CRF 5-layer CNN ID-CNN 9.92× 1.28× 12.38× 14.10× Table 2: Relative test-time speed of sentence mod- els, using the fastest batch size for each model.5 w/ DR Model w/o DR 89.34 ± 0.28 88.89 ± 0.30 Bi-LSTM 89.97 ± 0.20 89.74 ± 0.23 4-layer CNN 90.23 ± 0.16 89.93 ± 0.32 5-layer CNN 90.43 ± 0.12 90.01 ± 0.23 Bi-LSTM-CRF 90.32 ± 0.26 4-layer ID-CNN 89.65 ± 0.30 Table 3: Comparison of models trained with and without expectation-linear dropout regularization (DR). DR improves all models. the 5-layer CNN. (2017) We emphasize the importance of the dropout regularizer of Ma et al. in Table 3, where we observe increased F1 for every model trained with expectation-linear dropout regulariza- tion. Dropout is important for training neural net- work models that generalize well, especially on relatively small NLP datasets such as CoNLL- 2003. We recommend this regularizer as a sim- ple and helpful tool for practitioners training neu- ral networks for NLP. 6.3.2 Document-level prediction In Table 4 we show that adding document-level context improves every model on CoNLL-2003. Incorporating document-level context further im- proves our greedy ID-CNN model, attaining 90.65 average F1. We believe this model sees greater improvement with the addition of document-level context than the Bi-LSTM-CRF due to the ID- CNN learning a feature function better suited for representing broad context, in contrast with the Bi- LSTM which, though better than a simple RNN at encoding long memories of sequences, may reach its limit when provided with sequences more than 1,000 tokens long such as entire documents. We also note that our combination of training objective (Eqn. 11) and tied parameters (Eqn. 5Our ID-CNN could see up to 18× speed-up with a less naive implementation than is included in TensorFlow as of this writing.
Model 4-layer ID-CNN (sent) Bi-LSTM-CRF (sent) 4-layer CNN × 3 5-layer CNN × 3 Bi-LSTM Bi-LSTM-CRF ID-CNN F1 90.32 ± 0.26 90.43 ± 0.12 90.32 ± 0.32 90.45 ± 0.21 89.09 ± 0.19 90.60 ± 0.19 90.65 ± 0.15 Table 4: F1 score of models trained to predict document-at-a-time. Our greedy ID-CNN model performs as well as the Bi-LSTM-CRF. Model ID-CNN noshare ID-CNN 1-loss ID-CNN F1 89.81 ± 0.19 90.06 ± 0.19 90.65 ± 0.15 Table 5: Comparing ID-CNNs with 1) back- propagating loss only from the final layer (1-loss) and 2) untied parameters across blocks (noshare) 8) more effectively learns to aggregate this broad context than a vanilla cross-entropy loss or deep CNN back-propagated from the final neural net- work layer. Table 5 compares models trained to in- corporate entire document context using the docu- ment baselines described in Section 6.2. In Table 6 we show that, in addition to being more accurate, our ID-CNN model is also much faster than the Bi-LSTM-CRF when incorporating context from entire documents, decoding at almost 8 times the speed. On these long sequences, it also tags at more than 4.5 times the speed of the greedy Bi-LSTM, demonstrative of the benefit of our ID- CNNs context-aggregating computation that does not depend on the length of the sequence. 6.4 OntoNotes 5.0 English NER We observe similar patterns on OntoNotes as we do on CoNLL. Table 7 lists overall F1 scores of our models compared to those in the existing liter- ature. The greedy Bi-LSTM out-performs the lex- Speed Model Bi-LSTM-CRF 1× Bi-LSTM ID-CNN 4.60× 7.96× Model Ratinov and Roth (2009)6 Durrett and Klein (2014) Chiu and Nichols (2016) Bi-LSTM-CRF Bi-LSTM-CRF-Doc Bi-LSTM ID-CNN-CRF (1 block) ID-CNN-Doc (3 blocks) ID-CNN (3 blocks) ID-CNN (1 block) F1 83.45 84.04 86.19 ± 0.25 86.99 ± 0.22 86.81 ± 0.18 83.76 ± 0.10 86.84 ± 0.19 85.76 ± 0.13 85.27 ± 0.24 84.28 ± 0.10 Speed 1× 1.32× 24.44× 1.83× 21.19× 13.21× 26.01× Table 7: F1 score of sentence and document mod- els on OntoNotes. icalized greedy model of Ratinov and Roth (2009), and our ID-CNN out-performs the Bi-LSTM as well as the more complex model of Durrett and Klein (2014) which leverages the parallel co- reference annotation available in the OntoNotes corpus to predict named entities jointly with en- tity linking and co-reference. Our greedy model is out-performed by the Bi-LSTM-CRF reported in Chiu and Nichols (2016) as well as our own re-implementation, which appears to be the new state-of-the-art on this dataset. The gap between our greedy model and those using Viterbi decoding is wider than on CoNLL. We believe this is due to the more diverse set of entities in OntoNotes, which also tend to be much longer – the average length of a multi-token named entity segment in CoNLL is about one to- ken shorter than in OntoNotes. These long entities benefit more from explicit structured constraints enforced in Viterbi decoding. Still, our ID-CNN outperforms all other greedy methods, achieving our goal of learning a better token encoder for structured prediction. context Incorporating greater significantly boosts the score of our greedy model on OntoNotes, whereas the Bi-LSTM-CRF performs more poorly. In Table 7, we also list the F1 of our ID-CNN model and the Bi-LSTM-CRF model trained on entire document context. For the first time, we see the score decrease when more context is added to the Bi-LSTM-CRF model, though the ID-CNN, whose sentence model a lower score than that of the Bi-LSTM-CRF, sees an increase. We believe the decrease in the Bi-LSTM-CRF model occurs because of the Table 6: Relative test-time speed of document models (fastest batch size for each model). 6Results as reported in Durrett and Klein (2014) as this data split did not exist at the time of publication.
分享到:
收藏