logo资料库

最新《深度学习文本分类》2020综述论文【Snapchat-谷歌-微软】.pdf

第1页 / 共42页
第2页 / 共42页
第3页 / 共42页
第4页 / 共42页
第5页 / 共42页
第6页 / 共42页
第7页 / 共42页
第8页 / 共42页
资料共42页,剩余部分请下载后查看
Abstract
1 Introduction
1.1 Text Classification Tasks
1.2 Paper Structure
2 Deep Learning Models for Text Classification
2.1 Feed-Forward Neural Networks
2.2 RNN-Based Models
2.3 CNN-Based Models
2.4 Capsule Neural Networks
2.5 Models with Attention Mechanism
2.6 Memory-Augmented Networks
2.7 Transformers
2.8 Graph Neural Networks
2.9 Siamese Neural Networks
2.10 Hybrid Models
2.11 Beyond Supervised Learning
3 Text Classification Datasets
3.1 Sentiment Analysis Datasets
3.2 News Classification Datasets
3.3 Topic Classification Datasets
3.4 QA Datasets
3.5 NLI Datasets
4 Experimental Performance Analysis
4.1 Popular Metrics for Text Classification
4.2 Quantitative Results
5 Challenges and Opportunities
6 Conclusion
Acknowledgments
References
A Deep Neural Network Overview
A.1 Neural Language Models and Word Embedding
A.2 Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM)
A.3 Convolutional Neural Networks (CNNs)
A.4 Encoder-Decoder Models
A.5 Attention Mechanism
A.6 Transformer
0 2 0 2 r p A 6 ] L C . s c [ 1 v 5 0 7 3 0 . 4 0 0 2 : v i X r a Deep Learning Based Text Classification: A Comprehensive Review Shervin Minaee, Snapchat Inc Nal Kalchbrenner, Google Brain, Amsterdam Erik Cambria, Nanyang Technological University, Singapore Narjes Nikzad, University of Tabriz Meysam Chenaghlu, University of Tabriz Jianfeng Gao, Microsoft Research, Redmond Abstract. Deep learning based models have surpassed classical machine learning based approaches in various text classification tasks, including sentiment analysis, news categorization, question answering, and natural language inference. In this work, we provide a detailed review of more than 150 deep learning based models for text classification developed in recent years, and discuss their technical contributions, similarities, and strengths. We also provide a summary of more than 40 popular datasets widely used for text classification. Finally, we provide a quantitative analysis of the performance of different deep learning models on popular benchmarks, and discuss future research directions. Additional Key Words and Phrases: Text Classification, Sentiment Analysis, Question Answering, News Categorization, Deep Learning, Natural Language Inference, Topic Classification. ACM Reference Format: Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2020. Deep Learning Based Text Classification: A Comprehensive Review. 1, 1 (April 2020), 42 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn INTRODUCTION 1 Text classification, also known as text categorization, is a classical problem in natural language processing (NLP), which aims to assign labels or tags to textual units such as sentences, queries, paragraphs, and documents. It has a wide range of applications including question answering, spam detection, sentiment analysis, news categorization, user intent classification, content moderation, and so on. Text data can come from different sources, for example web data, emails, chats, social media, tickets, insurance claims, user reviews, questions and answers from customer services, and many more. Text is an extremely rich source of information, but extracting insights from it can be challenging and time-consuming, due to its unstructured nature. Text classification can be performed either through manual annotation or by automatic labeling. With the growing scale of text data in industrial applications, automatic text classification is becoming increasingly important. Approaches to automatic text classification can be grouped into three categories: • Rule-based methods • Machine learning (data-driven) based methods • Hybrid methods Authors’ addresses: Shervin Minaee Snapchat Inc; Nal Kalchbrenner Google Brain, Amsterdam; Erik Cambria Nanyang Technological University, Singapore; Narjes Nikzad University of Tabriz; Meysam Chenaghlu University of Tabriz; Jianfeng Gao Microsoft Research, Redmond. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. XXXX-XXXX/2020/4-ART $15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn , Vol. 1, No. 1, Article . Publication date: April 2020.
2 • S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao Rule-based methods classify text into different categories using a set of pre-defined rules. For example, any document with the words “football,” “basketball,” or “baseball” is assigned the “sport” label. These methods require a deep knowledge of the domain, and the systems are difficult to maintain. On the other hand, machine learning based approaches learn to make classifications based on past observations of the data. Using pre-labeled examples as training data, a machine learning algorithm can learn the inherent associations between pieces of text and their labels. Thus, machine learning based methods can detect hidden patterns in the data, are more scalable, and can be applied to various tasks. This is in contrast to rule-based methods, which need different sets of rules for different tasks. Hybrid methods, as the name suggests, use a combination of rule-based and machine learning methods to make predictions. Machine learning models have drawn a lot of attention in recent years. Most classical machine learning based models follow the popular two-step procedure, where in the first step some hand-crafted features are extracted from the documents (or any other textual unit), and in the second step those features are fed to a classifier to make a prediction. Some of the popular hand-crafted features include bag of words (BoW), and their extensions. Popular choices of classification algorithms include Naïve Bayes, support vector machines (SVM), hidden Markov model (HMM), gradient boosting trees, and random forests. The two-step approaches have several limitations. For example, reliance on the hand-crafted features requires tedious feature engineering and analysis to obtain a good performance. In addition, the strong dependence on domain knowledge for designing features makes the method difficult to easily generalize to new tasks. Finally, these models cannot take full advantage of large amounts of training data because the features (or feature templates) are pre-defined. A paradigm shift started occurring in 2012, when a deep learning based model, AlexNet [1], won the ImageNet competition by a large margin. Since then, deep learning models have been applied to a wide range of tasks in computer vision and NLP, improving the state-of-the-art [2–5]. These models try to learn feature representations and perform classification (or regression), in an end-to-end fashion. They not only have the ability to uncover hidden patterns in data, but also are much more transferable from one application to another. Not surprisingly, these models are becoming the mainstream framework for various text classification tasks in recent years. In this survey, we review more than 150 deep learning models developed for various text classification tasks, including sentiment analysis, news categorization, topic classification, question answering (QA), and natural language inference (NLI), over the course of the past six years. We group these works into several categories based on their neural network architectures, such as models based on recurrent neural networks (RNNs), convolutional neural networks (CNNs), attention, Transformers, Capsule Nets, and more. The contributions of this paper can be summarized as follows: • We present a detailed overview of more than 150 deep learning models proposed for text classification. • We review more than 40 popular text classification datasets. • We provide a quantitative analysis of the performance of a selected set of deep learning models on 16 popular benchmarks. • We discuss remaining challenges and future directions. 1.1 Text Classification Tasks This section briefly introduces different text classification tasks discussed in this paper: sentiment analysis, news categorization, topic classification, question answering (QA), and natural language inference (NLI). Sentiment Analysis. Sentiment analysis is a popular branch of text classification, which aims to analyze people’s opinions in textual data (such as product reviews, movie reviews, and tweets), and extract their polarity and viewpoint. Sentiment classification can be either a binary or a multi-class problem. Binary sentiment analysis is the classification of texts into positive and negative classes, while multi-class sentiment analysis focuses on classifying data into fine-grained labels or multi-level intensities. , Vol. 1, No. 1, Article . Publication date: April 2020.
Deep Learning Based Text Classification: A Comprehensive Review • 3 News Categorization. News contents are one of the most important sources of information that have a strong influence on people. A news classification system can help users obtain information of interest in real- time. Identifying emerging news topics and recommending relevant news based on user interests are two main applications of news classification. Topic Analysis. Topic analysis tries to automatically obtain meaning from texts by identifying their topics. Topic classification is one of the most important component technologies for topic analysis. The goal of topic classification is to assign one or more topics to each of the documents to make it easier to analyze. Question Answering (QA). There are two types of QA systems: extractive and generative. Extractive QA can be viewed as a special case of text classification. Given a question and a set of candidate answers (e.g., text spans in a given document in SQuAD [6]), we need to classify each candidate answer as correct or not. Generative QA learns to generate the answers from scratch (for example using a sequence-to-sequence model). The QA tasks discussed in this paper are extractive QA, unless otherwise stated. Natural language inference (NLI). NLI, also known as recognizing textual entailment (RTE), predicts whether the meaning of one text can be inferred from another. In particular, a system needs to assign to each pair of text units a label such as entailment, contradiction, and neutral [7]. Paraphrasing is a generalized form of NLI, also known as text pair comparison. The task is to measure the semantic similarity of a sentence pair in order to determine whether one sentence is a paraphrase of the other. 1.2 Paper Structure The rest of the paper is structured as follows: Section 2 presents a comprehensive overview of more than 150 deep learning based text classification models. Section 3 reviews some of the most popular text classification datasets. Section 4 presents a quantitative performance analysis of a selected set of deep learning models on 16 benchmarks. Section 5 discusses the main challenges and future directions for deep learning based text classification methods. Section 6 concludes the paper. Appendix A, provides an overview of some popular neural network model architectures that are commonly used for text classification. 2 DEEP LEARNING MODELS FOR TEXT CLASSIFICATION In this section, we review more than 150 deep learning frameworks proposed for various text classification problems. To make it easier to follow, we group these models into the following categories, based on their main architectural contributions: • Models based on feed-forward networks, which view text as a bag of words (Section 2.1). • Models based on RNNs, which view text as a sequence of words, and are intended to capture word dependencies and text structures (Section 2.2). • CNN-based models, which are trained to recognize patterns in text, such as key phrases, for classification (Section 2.3). • Capsule networks, which address the information loss problem suffered by the pooling operations of CNNs, and recently have been applied to text classification (Section 2.4). • The attention mechanism, which is effective to identify correlated words in text, and has become a useful tool in developing deep learning models (Section 2.5). • Memory-augmented networks, which combine neural networks with a form of external memory, which the models can read from and write to (Section 2.6). • Transformers, which allow for much more parallelization than RNNs, making it possible to efficiently (pre-)train very large language models using GPU clusters (Section 2.7). , Vol. 1, No. 1, Article . Publication date: April 2020.
4 • S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao • Graph neural networks, which are designed to capture internal graph structures of natural language, such as syntactic and semantic parse trees (Section 2.8). • Siamese Neural Networks, designed for text matching, a special case of text classification (Section 2.9) . • Hybrid models, which combine attention, RNNs, CNNs, etc. to capture local and global features of sentences and documents (Section 2.10). • Finally, in Section 2.11, we review modeling technologies that are beyond supervised learning, including unsupervised learning using autoencoder and adversarial training, and reinforcement learning. Readers are expected to be reasonably familiar with basic deep learning models to comprehend the content of this section. For more details on the basic deep learning architectures and models, we refer the readers to the deep learning textbook by Goodfellow et al. [140], or the appendix of this paper. 2.1 Feed-Forward Neural Networks Feed-forward networks are among the simplest deep learning models for text representation. Yet, they have achieved high accuracy on many text classification benchmarks. These models view a text as a bag of words. For each word, they learn a vector representation using an embedding model such as word2vec [8] or Glove [9], take the vector sum or average of the embeddings as the representation of the text, pass it through one or more feed-forward layers, known as Multi-Layer Perceptrons (MLPs), and then perform classification on the final layer’s representation using a classifier such as logistic regression, Naïve Bayes, or SVM [10]. An example of these models is the Deep Average Network (DAN) [10], whose architecture is shown in Fig. 1. Despite its simplicity, DAN outperforms other more sophisticated models which are designed to explicitly learn the compositionality of texts. For example, DAN outperforms syntactic models on datasets with high syntactic variance. Joulin et al. [11] propose a simple and efficient text classifier called fastText. Like DAN, fastText views a text as a bag of words. Unlike DAN, fastText uses a bag of n-grams as additional features to capture local word order information. This turns out to be very efficient in practice while achieving comparable results to the methods that explicitly use the word order [12]. Fig. 1. The architecture of the Deep Average Network (DAN) [10]. Le and Mikolov [13] propose doc2vec, which uses an unsupervised algorithm to learn fixed-length feature representations of variable-length pieces of texts, such as sentences, paragraphs, and documents. As shown in Fig. 2, the architecture of doc2vec is similar to that of the Continuous Bag of Words (CBOW) model [8, 14]. The only difference is the additional paragraph token that is mapped to a paragraph vector via matrix D. In doc2vec, the concatenation or average of this vector with a context of three words is used to predict the fourth word. The paragraph vector represents the missing information from the current context and can act as a memory of the topic of the paragraph. After being trained, the paragraph vector is used as features for the paragraph (e.g., in , Vol. 1, No. 1, Article . Publication date: April 2020.
Deep Learning Based Text Classification: A Comprehensive Review • 5 lieu of or in addition to BoW), and fed to a classifier for prediction. Doc2vec achieved new state-of-the-art results on several text classification and sentiment analysis tasks when it was published. Fig. 2. The doc2vec model [13]. 2.2 RNN-Based Models RNN-based models view text as a sequence of words, and are intended to capture word dependencies and text structures for text classification. However, vanilla RNN models do not work well, and often underperform feed- forward neural networks. Among many variants of RNNs, Long Short-Term Memory (LSTM) is the most popular architecture, which is designed to better capture long term dependencies. LSTM addresses the gradient vanishing or exploding problems suffered by the vanilla RNNs by introducing a memory cell to remember values over arbitrary time intervals, and three gates (input gate, output gate, forget gate) to regulate the flow of information into and out of the cell. There have been works on improving RNNs and LSTM models for text classification by capturing richer information, such as tree structures of natural language, long-span word relations in text, document topics, and so on. Tai et al. [15] have developed a Tree-LSTM model, a generalization of LSTM to tree-structured network typologies, to learn rich semantic representations. The authors argue that Tree-LSTM is a better model than chain-structured LSTM for NLP tasks because natural language exhibits syntactic properties that would naturally combine words to phrases. They validate the effectiveness of Tree-LSTM on two tasks: sentiment classification and predicting the semantic relatedness of two sentences. The architectures of these models are shown in Fig. 3. Zhu et al. [16] also extend the chain-structured LSTM to tree structures, using a memory cell to store the history of multiple child cells or multiple descendant cells in a recursive process. They argue that the new model provides a principled way of considering long-distance interaction over hierarchies, e.g., language or image parse structures. Fig. 3. (Left) A chain-structured LSTM network and (right) a tree-structured LSTM network with arbitrary branching factor [15]. To model long-span word relations for machine reading, Cheng et al. [17] augment the LSTM architecture with a memory network in place of a single memory cell. This enables adaptive memory usage during recurrence , Vol. 1, No. 1, Article . Publication date: April 2020.
6 • S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao with neural attention, offering a way to weakly induce relations among tokens. This model achieves promising results on language modeling, sentiment analysis, and NLI. The Multi-Timescale LSTM (MT-LSTM) neural network [18] is also designed to model long texts, such as sentences and documents, by capturing valuable information with different timescales. MT-LSTM partitions the hidden states of a standard LSTM model into several groups. Each group is activated and updated at different time periods. Thus, MT-LSTM can model very long documents. MT-LSTM has been reported to outperform a set of baselines, including the models based on LSTM and RNN, on text classification. RNNs are good at capturing the local structure of a word sequence, but face difficulties remembering long-range dependencies. In contrast, latent topic models are able to capture the global semantic structure of a document but do not account for word ordering. Bieng et al. [19] propose a TopicRNN model to integrate the merits of RNNs and latent topic models. It captures local (syntactic) dependencies using RNNs and global (semantic) dependencies using latent topics. TopicRNN has been reported to outperform RNN baselines for sentiment analysis. There are other interesting RNN-based models. Liu et al. [20] use multi-task learning to train RNNs to leverage labeled training data from multiple related tasks. Johnson and Rie [21] explore a text region embedding method using LSTM. Zhou et al. [22] integrate a Bidirectional-LSTM (Bi-LSTM) model with two-dimensional max pooling to capture text features. Wang et al. [23] propose a bilateral multi-perspective matching model under the “matching-aggregation” framework. Wan et al. [24] explore semantic matching using multiple positional sentence representations generated by a bi-directional LSMT model. 2.3 CNN-Based Models RNNs are trained to recognize patterns across time, whereas CNNs learn to recognize patterns across space [25]. RNNs work well for NLP tasks such as POS tagging or QA where the comprehension of long-range semantics is required, while CNNs work well where detecting local and position-invariant patterns is important. These patterns could be key phrases that express a particular sentiment like “I like” or a topic like ”endangered species”. Thus, CNNs have become one of the most popular model architectures for text classification. One of the first CNN-based models for text classification is proposed by Kalchbrenner et al. [26]. This model uses dynamic k-max pooling, and is called the Dynamic CNN (DCNN). As illustrated in Fig. 4, the first layer of DCNN constructs a sentence matrix using the embedding for each word in the sentence. Then a convolutional architecture that alternates wide convolutional layers with dynamic pooling layers given by dynamic k-max pooling is used to generate a feature map over the sentence that is capable of explicitly capturing short and long-range relations of words and phrases. The pooling parameter k can be dynamically chosen depending on the sentence size and the level in the convolution hierarchy. Later, Kim [27] proposed a much simpler CNN-based model than DCNN for text classification. As shown in Fig. 5, Kim’s model uses only one layer of convolution on top of word vectors obtained from an unsupervised neural language model i.e., word2vec. Kim also compared four different approaches to learning word embeddings: (1) CNN-rand, where all word embeddings are randomly initialized and then modified during training; (2) CNN-static, where the pre-trained word2vec embeddings are used and stay fixed during model training; (3) CNN-non-static, where the word2vec embeddings are fine-tuned during training for each task; and (4) CNN- multi-channel, where two sets of word embedding vectors are used, both are initialized using word2vec, with One updated during model training while the other fixed. These CNN-based models were reported to improve upon the state-of-the-art on sentiment analysis and question classification. There have been efforts of improving the architectures of CNN-based models of [26, 27]. Liu et al. [28] propose a new CNN-based model that makes two modifications to the architecture of Kim-CNN [27]. First, a dynamic max-pooling scheme is adopted to captures more fine-grained features from different regions of the document. Second, a hidden bottleneck layer is inserted between pooling and output layer to learn compact document , Vol. 1, No. 1, Article . Publication date: April 2020.
Deep Learning Based Text Classification: A Comprehensive Review • 7 Fig. 4. The architecture of DCNN model [26]. Fig. 5. The architecture of a sample CNN model for text classification. courtesy of Yoon Kim [27]. representations to reduce model size and boosts model performance. In [29, 30], instead of using pre-trained low-dimensional word vectors as input to CNNs, the authors directly apply CNNs to high-dimensional text data to learn the embeddings of small text regions for classification. Character-level CNNs have also been explored for text classification [31, 32]. One of the first such models is proposed by Zhang et al. [31]. As illustrated in Fig. 6, the model takes as input the characters in a fixed-sized, encoded as one-hot vectors, passes them through a deep CNN model that consists of six convolutional layers with pooling operations and three fully connected layers. Prusa et al. [33] presented a approach to encoding text using CNNs that greatly reduces memory consumption and training time required to learn character-level text representations. This approach scales well with alphabet size, allowing to preserve more information from the original text to enhance classification performance. There are studies on investigating the impact of word embeddings and CNN architectures on model performance. Inspired by VGG [34] and ResNets [35], Conneau et al. [36] presented a Very Deep CNN (VDCNN) model for text processing. It operates directly at the character level and uses only small convolutions and pooling operations. This study shows that the performance of VDCNN increases with the depth. Duque et al. [37] modify the structure of VDCNN to fit mobile platforms’ constraints and keep performance. They were able to compress the model size by 10x to 20x with an accuracy loss between 0.4% to 1.3%. Le et al. [38] showed that deep models indeed outperform shallow models when the text input is represented as a sequence of characters. However, a simple , Vol. 1, No. 1, Article . Publication date: April 2020.
8 • S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao Fig. 6. The architecture of a character-level CNN model [31]. shallow-and-wide network outperforms deep models such as DenseNet[39] with word inputs. Guo et al. [40] studied the impact of word embedding and proposed to use weighted word embeddings via a multi-channel CNN model. Zhang et al. [41] examined the impact of different word embedding methods and pooling mechanisms, and found that using non-static word2vec and GloVe outperforms one-hot vectors, and that max-pooling consistently outperforms other pooling methods. There are other interesting CNN-based models. Mou et al. [42] present a tree-based CNN to capture sentence- level semantics. Pang et al. [43] cast text matching as the image recognition task, and use multi-layer CNNs to identify salient n-gram patterns. Wang et al. [44] propose a CNN-based model that combines explicit and implicit representations of short text for classification. There is also a growing interest in applying CNNs to biomedical text classification [45–48]. 2.4 Capsule Neural Networks CNNs classify images or texts by using successive layers of convolutions and pooling. Although pooling operations identify salient features and reduce the computational complexity of convolution operations, they lose information regarding spatial relationships and are likely to mis-classify entities based on their orientation or proportion. To address the problems of pooling, a new approach is proposed by Geoffrey Hinton, called capsule networks (CapsNets) [49, 50]. A capsule is a group of neurons whose activity vector represents different attributes of a specific type of entity such as an object or an object part. The vector’s length represents the probability that the entity exists, and the orientation of the vector represents the attributes of the entity. Unlike max-pooling of CNNs (which selects some information and discards the rest), capsules “rout” each capsule in the lower layer to its best parent capsule in the upper layer, using all the information available in the network up to the final layer for classification. Routing can be implemented using different algorithms, such as dynamic routing-by-agreement [50] or the EM algorithm [51]. Recently, capsule networks have been applied to text classification, where capsules are adapted to represent a sentence or document as a vector. [52–54] proposed a text classification model based on a variant of CapsNets. The model consists of four layers: (1) an n-gram convolutional layer, (2) a capsule layer, (3) a convolutional capsule layer, and (4) a fully connected capsule layer. The authors experimented three strategies to stabilize the dynamic routing process to alleviate the disturbance of the noise capsules that contain background information such as stopwords or the words that are unrelated to any document categories. They also explored two capsule architectures, denoted as Capsule-A and Capsule-B as in Fig. 7. Capsule-A is similar to the CapsNet in [50]. Capsule-B uses three parallel networks with filters with different window sizes in the n-gram convolutional layer to learn a more comprehensive text representation. CapsNet-B performs better in the experiments. The CapsNet-based model proposed by Kim et al. [55] uses a similar architecture. The model consists of (1) an input layer that takes a document as a sequence of word embeddings; (2) a convolutional layer that generates feature maps and uses a gated-linear unit to retain spatial information; (3) a convolutional capsule layer to form global features by aggregating local features detected by the convolutional layer; and (4) a text capsule layer to , Vol. 1, No. 1, Article . Publication date: April 2020.
分享到:
收藏