logo资料库

《低资源自然语言处理》综述论文(2020年最新进展).pdf

第1页 / 共21页
第2页 / 共21页
第3页 / 共21页
第4页 / 共21页
第5页 / 共21页
第6页 / 共21页
第7页 / 共21页
第8页 / 共21页
资料共21页,剩余部分请下载后查看
A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios Michael A. Hedderich*1, Lukas Lange*1,2, Heike Adel2, Jannik Str¨otgen2 & Dietrich Klakow1 1Saarland University, Saarland Informatics Campus, Germany 2Bosch Center for Artificial Intelligence, Germany {mhedderich,dietrich.klakow}@lsv.uni-saarland.de {lukas.lange,heike.adel,jannik.stroetgen}@de.bosch.com 0 2 0 2 t c O 3 2 ] L C . s c [ 1 v 9 0 3 2 1 . 0 1 0 2 : v i X r a Abstract Current developments in natural language pro- cessing offer challenges and opportunities for low-resource languages and domains. Deep neural networks are known for requiring large amounts of training data which might not be available in resource-lean scenarios. How- ever, there is also a growing body of works to improve the performance in low-resource set- tings. Motivated by fundamental changes to- wards neural models and the currently popu- lar pre-train and fine-tune paradigm, we give an overview of promising approaches for low- resource natural language processing. After a discussion about the definition of low-resource scenarios and the different dimensions of data availability, we then examine methods that enable learning when training data is sparse. This includes mechanisms to create additional labeled data like data augmentation and dis- tant supervision as well as transfer learning settings that reduce the need for target super- vision. The survey closes with a brief look into methods suggested in non-NLP machine learning communities, which might be benefi- cial for NLP in low-resource scenarios Introduction 1 Most of today’s research in natural language pro- cessing (NLP) is concerned with the processing of around 10 to 20 high-resource languages with a special focus on English, and thus, ignores thou- sands of languages with billions of speakers (Ben- der, 2019). The rise of data-hungry deep-learning systems increased the performance of NLP for high resource-languages, but the shortage of large-scale data in less-resourced languages makes the pro- cessing of them a challenging problem. Therefore, Ruder (2019) named NLP for low-resource scenar- ios one of the four biggest open problems in NLP nowadays. According to Ethnologue (Eberhard * equal contribution et al., 2019), more than 310 languages exist with at least one million L1-speakers each. Similarly, Wikipedia exists for 300 languages, an indicator for the use of digital infrastructure in these languages.1 But also for languages with fewer speakers, the availability of NLP tools could help to expand the language usage and could possibly prevent the ex- tinction of endangered languages. The treaty on the functioning of the European Union (TFEU) emphasizes to keep and disseminate cultural and linguistic diversity of the Member States (Article 165(1) TFEU). Supporting technological develop- ments for low-resource languages can help to in- crease participation of the speakers’ communities in a digital world. Also, low-resource settings do not only concern low-resource languages but also other scenarios, such as non-standard domains and tasks, for which only little training data is available. The importance of low-resource scenarios and the significant changes in natural language pro- cessing in the last years lead to active research on resource-lean settings and a wide variety of tech- niques have been proposed. In this survey, we give a structured overview of the literature on low- resource NLP with a focus on recent works and also inspecting open issues that are still encoun- tered. There is a zoo of methods to tackle the challenges of low-resource NLP. However, which method is applicable and promising highly depends on the characteristics of the low-resource scenario. Thus, one key goal of this survey is to structure the techniques based on the underlying assumptions regarding the low-resource setup. The paper starts with a discussion on the defi- nition of the term “low-resource” along the three dimensions of labeled, unlabeled and further auxil- iary data. An overview of the afterwards discussed techniques is given in Table 1. Many methods 1https://en.wikipedia.org/wiki/List_ of_Wikipedias
can be used for multiple tasks and domains. Thus, when referring to language in this paper, this can also include the specific language used in a domain. 2 Related Work In the past, surveys on specific methods or cer- tain low-resource language families have been pub- lished. These are listed in Table 3 in the Appendix. As recent surveys on low-resource machine trans- lation (Liu et al., 2019) and unsupervised domain adaptation (Ramponi and Plank, 2020) are already available, we do not investigate them further in this paper. Instead, our focus lies on general meth- ods for low-resource, supervised natural language processing including data augmentation, distant supervision and transfer learning. This is also in contrast to the task-specific survey by Magueresse et al. (2020) who review highly influential work for several extraction tasks, but only provide little overview of recent approaches. 3 Defining “Low-Resource” The umbrella term low-resource covers a spectrum of scenarios with varying resource conditions. It includes work on threatened languages, such as Yongning Na, a Sino-Tibetan language with 40k speakers and only 3k written, unlabeled sentences (Adams et al., 2017). But, it also covers work on specialized domains or tasks in English, which is often treated as the most high-resource language. Figure 1 shows exemplarily which NLP tasks were addressed in six different languages from ba- sic tasks to higher-level tasks. While it is possible to build English NLP systems for many higher- level applications, low-resource languages lack the data foundation for this. Additionally, even if it is possible to create basic systems for tasks, such as tokenization and named entity recognition, for all tested low-resource languages, the training data is typically of lower quality compared to the En- glish datasets, or very limited in size. It also shows that the four American and African languages with between 1.5 and 80 million speakers have been addressed less than the Estonian language, with 1 million speakers. This indicates the unused po- tential to reach millions of speakers who currently have no access to higher-level NLP applications. 3.1 Dimensions of Resource Availability Many techniques presented in the literature depend on certain assumptions about the low-resource sce- Figure 1: Supported NLP tasks in different languages. Note that the figure does not incorporate data quality or system performance. More details on the selection of tasks and languages are given in Appendix Section A. nario. These have to be adequately defined to evaluate their applicability for a specific setting and to avoid confusion when comparing differ- ent approaches. We propose to categorize low- resource settings along the following three dimen- sions: availability of (i) task-specific labels, (ii) unlabeled language text, and (iii) auxiliary data: (i) The availability of task-specific labels in the target language (or target domain) is the most prominent dimension being necessary for super- vised learning. These labels are usually created through manual annotation, which can be both time- and cost-intensive. Not having access to adequate experts to perform the annotation can also be an issue for some languages and domains. (ii) The availability of unlabeled language- or domain-specific text is another factor to take into consideration, especially as most modern NLP ap- proaches are based on some form of input embed- dings trained on unlabeled texts. (iii) Most of the ideas surveyed in the next sec- tions make use of auxiliary data which can have many forms. Transfer learning might leverage task- specific labels in a different language or domain. Distant supervision utilizes external sources of in- formation, such as knowledge bases or gazetteers. And some approaches require other NLP tools in the target language like machine translation to gen- erate training data. It is essential to consider this as results from one low-resource scenario might not be transferable to another one if the assumptions on the auxiliary data are broken. 3.2 How Low is Low-Resource? On the dimension of task-specific labels, differ- ent thresholds are used to define low-resource and some works use a certain guideline or principle. Rotman and Reichart (2019) and Kann et al. (2020) study languages that have less than 10k labeled to-
Method Requirements Outcome Data Augmentation (§ 4.1) Distant Supervision (§ 4.2) Cross-lingual projections (§ 4.3) Embeddings & Pretrained LMs (§ 5.1) LM domain adaptation (§ 5.4) Multilingual LMs (§ 5.3) Adversarial Discriminator (§ 6) Meta-Learning (§ 6) labeled data, heuristics* unlabeled data, heuristics* high- unlabeled resource data, cross-lingual alignment unlabeled data data, labeled additional labeled data additional labeled data additional labeled data better language representation existing LM, unlabeled domain data multilingual data additional datasets multiple auxiliary tasks unlabeled domain-specific language rep- resentation multilingual feature represen- tation independent representations better target task performance For low-resource languages domains                 Table 1: Overview of low-resource methods surveyed in this paper. * Heuristics are typically gathered manually. kens in the Universal Dependency project (Nivre et al., 2020) while Garrette and Baldridge (2013) limit the time of the annotators to 2 hours result- ing in up to 1-2k tokens. Loubser and Puttkammer (2020) report that most available datasets for South African languages have 40-60k labeled tokens. The amount of necessary resources is also task- dependent. While the aforementioned corpus sizes might be sufficient to train a reasonable POS tagger, other tasks might also increase the resource require- ments. For text generation, Yang et al. (2019) frame their work as low-resource with 350k labeled train- ing instances. Similar to the task, the resource re- quirements can also depend on the language. Plank et al. (2016) find that language families perform differently given the same amount of limited train- ing data. Last but not least, the available resources also influence which approaches perform well. As shown for POS tagging (Plank et al., 2016) and text classification (Melamud et al., 2019), in very low-resource settings, non-neural methods outper- form more modern approaches while the latter need several hundred labeled instances. This makes eval- uations interesting that vary the availability of a resource like the amount of labeled data (see also e.g. Lauscher et al. (2020); Hedderich et al. (2020); Yan et al. (2020)). In this survey, we will not focus on a specific low-resource scenario but rather spec- ify which kind of resources the authors assume. 4 Generating Additional Labeled Data Faced with the lack of task-specific labels, a vari- ety of approaches have been developed that find alternative forms of labeled data as replacements for gold-standard supervision. This is usually done through some form of expert insights in combina- tion with automation. These labels tend to be of lower quality than their manually annotated coun- terparts and contain more errors or label-noise. They are, however, easier to obtain, as the man- ual intervention is focused and limited to setting up the technique. It is, therefore, suited for low- resource scenarios. We group these ideas into two main categories: data augmentation (which uses task-specific instances to create more of them) and distant supervision (which labels unlabeled data). Connected to this, active research exists on learn- ing with such noisily or weakly labeled data that tries to handle errors in this automatically created data and better leverage the additional supervision. 4.1 Data Augmentation In data augmentation, new labeled instances are created by modifying the features of existing in- stances with transformations that do not change the label of an instance. In the computer vision community, this is a popular approach where, e.g., rotating an image is invariant to the classification of an image’s content. For text, this can be done by replacing words with equivalents from a collection of synonyms (Wei and Zou, 2019), entities of the same type (Raiman and Miller, 2017; Dai and Adel, 2020) or words that share the same morphology (Gulordava et al., 2018; Vania et al., 2019). This re- placement can also be guided by a language model that takes context into consideration (Kobayashi, 2018; Fadaee et al., 2017). Data augmentation can
also be performed on sentence parts based on their syntactic structure. S¸ahin and Steedman (2018) and Vania et al. (2019) rotate parts of the depen- dency tree that allow such operations to obtain sen- tence variations. The former also simplify exist- ing, labeled sentences by removing sentence parts. The subject-object relation within a sentence is in- versed by Min et al. (2020). On the level of vector representations, Yasunaga et al. (2018) add small perturbations via adversarial learning to existing sentences, preventing overfitting in low-resource POS tagging. Cheng et al. (2020) show how virtual training sentences can be created using a generative model to interpolate between two given sentences. Data augmentation can also be achieved by mod- ifying instances so that the label changes. For gram- mar correction, Grundkiewicz et al. (2019) use cor- rect sentences and apply a set of transformations that introduce errors. A third option is the genera- tion of features based on the labels. This is a pop- ular approach in machine translation where target sentences are back-translated into source sentences (Bojar and Tamchyna, 2011; Hoang et al., 2018). An important aspect here is that errors in the source side/features do not seem to have a large negative effect on the generated target text the model needs to predict. It is therefore also used in other text gen- eration tasks like abstract summarization (Parida and Motlicek, 2019) and table-to-text generation (Ma et al., 2019). For sentence classification, a label-dependent language model can be fine-tuned on a small labeled dataset to generate labeled train- ing sentences (Kumar et al., 2020; Anaby-Tavor et al., 2020). Kafle et al. (2017) train a model to create new questions for visual question-answering. While data augmentation is ubiquitous in the computer vision community, it has not found such a widespread in natural language processing and is often evaluated only on selected tasks, especially compared to the feature representation approaches of Section 5. A reason might be that the presented techniques are often task-specific requiring either hand-crafted systems or training of task-dependent language models. Data augmentation that can be applied across tasks and languages is an interesting aim for future research. 4.2 Distant Supervision In contrast to data augmentation, distant or weak supervision uses unlabeled data and keeps it un- modified. The corresponding labels are obtained through a (semi-)automatic process from an ex- ternal source of information. For named entity recognition (NER), a list of location names might be obtained from a dictionary and matches of to- kens in the text with entities in the list are auto- matically labeled as locations. Distant supervision was introduced by Mintz et al. (2009) for relation extraction (RE) with extensions on multi-instance (Riedel et al., 2010) and multi-label learning (Sur- deanu et al., 2012). It is still a popular approach for information extraction tasks like NER and RE where the external information can be obtained from knowledge bases, gazetteers, dictionaries and other forms of structured knowledge sources (Luo et al., 2017; Hedderich and Klakow, 2018; Deng and Sun, 2019; Alt et al., 2019; Ye et al., 2019; Lange et al., 2019a; Nooralahzadeh et al., 2019; Le and Titov, 2019; Cao et al., 2019; Lison et al., 2020). The automatic annotation can go from sim- ple string matching (Yang et al., 2018) to complex pipelines including classifiers and manual steps (Norman et al., 2019). This distant supervision us- ing information from external knowledge sources can be seen as a subset of the more general ap- proach of labeling rules. These encompass also other ideas like reg-ex rules or simple program- ming functions (Ratner et al., 2017; Zheng et al., 2019; Adelani et al., 2020; Lison et al., 2020). For deciding on the complexity of the distant supervi- sion technique, required effort and necessary label quality need to be taken into account. Instead of using information from external knowledge sources to generate labels, it can also be used as additional features, e.g., for NER in low-resource languages (Plank and Agi´c, 2018; Rijhwani et al., 2020). As an alternative to an automatic annotation process, annotations might also be provided by non-experts. Similar to distant supervision, this results in a trade-off between la- bel quality and availability. For instance, Garrette and Baldridge (2013) obtain labeled data from non- native-speakers and without a quality control on the manual annotations. This can be taken even fur- ther by employing annotators who do not speak the low-resource language (Mayhew and Roth, 2018; Mayhew et al., 2019; Tsygankova et al., 2020). While distant supervision is popular for informa- tion extraction tasks like NER or RE, it is less preva- lent in other areas of NLP. Nevertheless, distant su- pervision has also been successfully employed for other tasks by proposing new ways for automatic
annotation. Li et al. (2012) leverage a knowledge base for POS tagging. Wang et al. (2019) use con- text by transferring a document-level sentiment label to sentence-level. Huber and Carenini (2019) build a discourse-structure dataset using guidance from a sentiment classifier. For topic classification, Bach et al. (2019) use heuristics and inputs from other classifiers like NER. Distant supervision is also popular in other fields like image classification (Xiao et al., 2015; Li et al., 2017; Lee et al., 2018; Mahajan et al., 2018; Li et al., 2020). This sug- gests that distant supervision could be leveraged for more NLP tasks in the future. Distant supervision methods heavily rely on aux- iliary data. In a low-resource setting, it might be dif- ficult to obtain not only labeled data but also such auxiliary data. Kann et al. (2020) find a large gap between the performance on high-resource and low- resource languages for POS pointing to the lack of high-coverage and error-free dictionaries for the weak supervision in low-resource languages. This emphasizes the need for evaluating such methods in a realistic setting and avoiding to just simulating restricted access to labeled data in a high-resource language. A discussion of approaches to leverage flawed distant supervision is given in Section 4.4. While distant supervision allows obtaining la- beled data more quickly than manually annotat- ing every instance of a dataset, it still requires human interaction to create automatic annotation techniques or provide labeling rules. This time and effort could also be spent on annotating more gold label data, either naively or through an active learn- ing scheme. Unfortunately, distant supervision pa- pers seldomly provide information on how long the creation took, making it difficult to compare these approaches. Taking the human expert into the focus connects this research direction with human- computer-interaction and human-in-the-loop setups like (Klie et al., 2018). 4.3 Cross-Lingual Projections For cross-lingual projections, a task-specific clas- sifier is trained in a high-resource language. Us- ing parallel corpora, the unlabeled low-resource data is then aligned to its equivalent in the high- resource language where labels can be obtained using the aforementioned classifier. These labels (on the high-resource text) can then be projected back to the text in the low-resource language based on the alignment between tokens in the parallel texts (Yarowsky et al., 2001). This approach can, therefore, be seen as a form of distant supervi- sion specific for obtaining labeled data for low- resource languages. Cross-lingual projections have been applied in low-resource settings for tasks, such as POS tagging and parsing (T¨ackstr¨om et al., 2013; Wisniewski et al., 2014; Plank and Agi´c, 2018). Instead of using parallel corpora, existing high-resource labeled datasets can also be machine- translated into the low-resource language (Khalil et al., 2019; Zhang et al., 2019a; Fei et al., 2020). These cross-lingual projections set high require- ments on the auxiliary data requiring both, labels in a high-resource language and means to project them into a low-resource language. Especially the latter might be an issue as machine translation might be problematic for a specific low-resource language by itself. Sources for parallel text can be the OPUS project (Tiedemann, 2012), Bible cor- pora (Mayer and Cysouw, 2014; Christodoulopou- los and Steedman, 2015) or the recently presented JW300 corpus Agi´c and Vuli´c (2019) with paral- lel texts for over 300 languages. A limitation are the limited domains of these datasets like politi- cal proceedings or religious texts. Mayhew et al. (2017) propose to use a simpler, word and lexicon- based translation instead and Fang and Cohn (2017) present a system based on bilingual dictionaries. 4.4 Learning with Noisy Labels Distantly supervised labels might be quicker and cheaper to obtain than manual annotations, but they also tend to contain more errors. Even though more training data is available, training directly on this noisily-labeled data can actually hurt the perfor- mance. Therefore, many recent approaches use a noise handling method to diminish the negative ef- fects of distant supervision. We categorize these into two ideas: noise filtering and noise modeling. Noise filtering methods remove instances from the training data that have a high probability of being incorrectly labeled. This often includes train- ing a classifier to make the filtering decision. The filtering can remove the instances completely from the training data, e.g., through a probability thresh- old (Jia et al., 2019), a binary classifier (Adel and Sch¨utze, 2015; Onoe and Durrett, 2019; Huang and Du, 2019) or the use of a reinforcement-based agent (Yang et al., 2018; Nooralahzadeh et al., 2019). Al- ternatively, a soft filtering might be applied that re-weights instances according to their probability
of being correctly labeled (Le and Titov, 2019) or an attention measure (Hu et al., 2019). The noise in the labels can also be modeled. A common model is a confusion matrix estimating the relationship between clean and noisy labels (Fang and Cohn, 2016; Luo et al., 2017; Hedderich and Klakow, 2018; Paul et al., 2019; Lange et al., 2019a,b; Chen et al., 2019; Wang et al., 2019). The classifier is no longer trained directly on the noisily- labeled data. Instead, a noise model is appended which shifts the noisy to the (unseen) clean label distribution. This can be interpreted as the original classifier being trained on a ”cleaned” version of the noisy labels. In Ye et al. (2019), the prediction is shifted from the noisy to the clean distribution during testing. In Chen et al. (2020) a group of rein- forcement agents relabels noisy instances. Rehbein and Ruppenhofer (2017) and Lison et al. (2020) leverage several sources of distant supervision and learn how to combine them. Especially for NER, the noise in distantly super- vised labels tends to be false negative errors, i.e., entities that were not annotated. Partial annotation learning (Yang et al., 2018; Nooralahzadeh et al., 2019; Cao et al., 2019) takes this into account ex- plicitly. Related approaches learn latent variables (Jie et al., 2019), use constrained binary learning (Mayhew et al., 2019) or construct a loss under the assumption that only unlabeled positive instances are available (Peng et al., 2019). 5 Transfer Learning While distant supervision and data augmentation generate and extend task-specific training data, transfer learning reduces the need for labeled tar- get data by transferring learned representations and models. This has two aspects: (i) Unlabeled data can be used to obtain pre-trained language repre- sentations. A strong focus in recent works on trans- fer learning in NLP lies in the use of pre-trained language representations like BERT (Devlin et al., 2019) as well as the training of domain-specific or multilingual representations. (ii) Auxiliary data can be used to train and transfer models from re- lated tasks in the same language, or the same (or similar) task from other domains or languages. 5.1 Pre-trained Language Representations Feature vectors are the core input component of many neural network-based models for NLP tasks: numerical representations of words or sentences, as neural architectures do not allow the process- ing of strings and characters as such. Collobert et al. (2011) showed, that training these models for the task of language-modeling on a large-scale cor- pus results in high-quality word representations, that can be reused for other downstream tasks as well. Thus, pre-trained language representa- tions have become essential components of neu- ral NLP models, as these models can leverage information learned from large-scale, unlabeled text sources. Starting with the work by Collobert et al. (2011) and Mikolov et al. (2013a), word em- beddings are pre-trained on large background cor- pora. Subword-based embeddings such as fast- Text n-gram embeddings (Bojanowski et al., 2017) and byte-pair-encoding embeddings (Heinzerling and Strube, 2018) addressed out-of-vocabulary is- sues and enabled the processing of texts in many languages, including multiple low-resource lan- guages, as pre-trained embeddings were published for more than 270 languages covered in Wikipedia for both embedding methods. Zhu et al. (2019) showed that embeddings using subword informa- tion, e.g., n-grams or byte-pair-encodings, are ben- eficial for low-resource sequence labeling tasks, such as named entity recognition and typing, and outperform word-level embeddings. Jungmaier et al. (2020) added smoothing to word2vec models to correct its bias towards rare words and achieved improvements in particular for low-resource set- tings. However, data quality for low-resource, even for unlabeled data, might not be comparable to data from high-resource languages. Alabi et al. (2020) found that word embeddings trained on massive amounts of unlabeled data from low-resource lan- guages are not competitive to embeddings trained on smaller, but curated data sources. More recently, a trend emerged of pre-training large embedding models using a language model objective to create context-aware word represen- tations. Starting with the recurrent ELMo model (Peters et al., 2018), the trend evolved towards pre- trained transformer models (Vaswani et al., 2017), such as BERT (Devlin et al., 2019), GPT2 (Rad- ford et al., 2019) or RoBERTa (Liu et al., 2019b). These models generate representations by levering multiple words of a sequence to generate a con- textualized representation. By pre-training on a large-scale dataset of unlabeled texts, these mod- els can overcome a lack of labeled data by using prior learned knowledge, enabling usage in a low-
resource scenario (Cruz and Cheng, 2019). While pre-trained language models achieve sig- nificant performance increases, it is still question- able if these methods are suited for real-world low- resource scenarios. For example, all of these mod- els require large hardware requirements, in partic- ular, considering that the transformer model size keeps increasing to boost performance (Raffel et al., 2019). Therefore, these large-scale methods might not be suited for low-resource scenarios where hardware is also low-resource. van Biljon et al. (2020) showed that low- to medium-depth trans- former sizes are most beneficial for low-resource languages and Melamud et al. (2019) showed that simple bag-of-words approaches are better when there are only a few dozen training instances or less for text for classification, while more complex transformer models require more training data. 5.2 Multilingual Transfer Low-resource languages can also benefit from la- beled resources available in other high-resource languages. Most often, neural models for cross- and multilingual processing are transferred by cre- ating a common multilingual embedding space of multiple languages. This can be done by, e.g., com- puting a mapping between two different embedding spaces, such that the words in both embeddings share similar feature vectors after the mapping, the so-called alignment (Mikolov et al., 2013b; Joulin et al., 2018; Ruder et al., 2019). For example, monolinguals embeddings can be aligned to the same space. Then, the original embeddings can be replaced with a different, but alignment embed- dings. Assuming a perfect alignment, the model would still perform well on the new unseen lan- guage, as the feature vectors point in known direc- tions. Smith et al. (2017) proposed to align embed- dings by creating transformation matrices based on bilingual dictionaries to map different embeddings to a common space. Zhang et al. (2019b) created bilingual representations by creating cross-lingual word embeddings using a small set of parallel sen- tences for low-resource document retrieval. 5.3 Multilingual Language Models Unsupervised pre-training enabled another fruitful research direction: the training of highly multi- lingual representations in a single model, such as multilingual BERT (Devlin et al., 2019) or XLM- RoBERTa (Conneau et al., 2020). These models are trained using unlabeled, monolingual corpora from different languages and can be used in cross- and multilingual settings, due to many languages seen during pre-training. In cross-lingual zero-shot learning, no task-specific labeled data is available in the low-resource target language. Instead, la- beled data from a high-resource language is lever- aged. Hu et al. (2020) showed, however, that there is still a large gap between low and high-resource setting. Lauscher et al. (2020) and Hedderich et al. (2020) proposed adding additionally a minimal amount of target-task and -language data (in the range of 10 to 100 labeled sentences) which re- sulted in a significant boost in performance. However, recently it was questioned, whether these models are truly multilingual and evidence was found that language-specific information is stored at least in the upper transformer layers and more general, language-independent representa- tions in the lower layers. (Pires et al., 2019; Singh et al., 2019; Libovick`y et al., 2020). Nonetheless, these models are state of the art in the cross-lingual transfer (Wu and Dredze, 2019), including transfer to unseen low-resource languages (K et al., 2020; Liu et al., 2020). Further, it was shown that trans- former models can also benefit from alignment either similar to traditional non-contextualized em- beddings by projecting to the same space (Wu et al., 2019), or also by aligning the languages inside a single multilingual model, i.a., in cross- lingual (Schuster et al., 2019; Liu et al., 2019a) or multilingual settings (Cao et al., 2020). While these models are a tremendous step to- wards enabling NLP in many languages, possible claims that these are universal language models do not hold. For example, mBERT covers 104 and XLM-R 100 languages, which is a third of all lan- guages in Wikipedia as outlined earlier. A notable exception are the multilingual FLAIR embeddings that were trained on more than 380 languages from the JW300 and OPUS corpora, but these are typ- ically used within the limited scope of sequence labeling compared to the broad usage of BERT models. Further, Wu and Dredze (2020) showed that, in particular, low-resource languages are not well-represented in mBERT. Figure 2 shows which language families with at least 1 million speakers are covered by mBERT and XLM-RoBERTa 2. In particular, African and American languages are not well-represented within the transformer mod- 2A language family is covered if at least one associated language is covered. Language families can belong to multiple regions, e.g. Indo-European belongs to Europe+Asia
numerous English domains and tasks. Aharoni and Goldberg (2020) found domain-specific clus- ters in pre-trained language models and showed how these could be exploited for data selection in domain-sensitive training. 6 Other Machine Learning Approaches Training on a limited amount of data is not unique to natural language processing. Other areas, like general machine learning and computer vision, can be a useful source for insights and new ideas. One of these is Meta-Learning (Finn et al., 2017), which is based on multi-task learning. Given a set of aux- iliary high-resource tasks and a low-resource target task, meta-learning trains a model to decide how to use the auxiliary tasks in the most beneficial way for the target task. Yu et al. (2018) evaluated such a concept on sentiment and user intent clas- sification. Dou et al. (2019) applied this approach to natural language understanding tasks from the GLUE benchmark. Instead of having a set of tasks, Rahimi et al. (2019) built an ensemble of language- specific NER classifiers which are then weighted depending on the zero- or few-shot target language. Differences in the features between the pre- training and the target domain can be an issue in transfer learning, especially in neural approaches where it can be difficult to control which informa- tion the model takes into account. Adversarial dis- criminators (Goodfellow et al., 2014) can prevent the model from learning a feature-representation that is specific to a data source. Gui et al. (2017), Liu et al. (2017), Kasai et al. (2019), Grießhaber et al. (2020) and Zhou et al. (2019) learned domain- independent representations using adversarial train- ing. Kim et al. (2017), Chen et al. (2018) and Lange et al. (2020) worked with language-independent representations for cross-lingual transfer. These examples show the beneficial exchange of ideas be- tween NLP and the machine learning community. 7 Conclusion In this survey, we gave a structured overview of recent work in the field of low-resource natural lan- guage processing. We showed that it is essential to analyze resource-lean scenarios across the differ- ent dimensions of data-availability. This can reveal which techniques are expected to be applicable and helpful in a specific low-resource setting. We hope that our discussions on open issues for the different approaches can serve as inspiration for future work Figure 2: Language families with more than 1 million speakers covered by multilingual transformer models. els, even though millions of people speak these lan- guages. This can be problematic, as languages from more distant language families are less suited for transfer learning, as Lauscher et al. (2020) showed. 5.4 Domain-Specific Pretraining The existing pre-trained transformer models also outperformed existing models on a wide range of domains. For example, Chronopoulou et al. (2019) and Dirkson and Verberne (2019) transferred pre- trained language models to twitter texts. Even though the respective models were pre-trained on general-domain data, such as texts from the news or web-domain, the models can be successfully transferred to texts from unseen domains, such as the clinical domain (Sun and Yang, 2019) or the materials science domain (Friedrich et al., 2020). Lin et al. (2020) found that the plain BERT model performed best in their setup for clinical negation detection compared to several models fine-tuned on the clinical domain. However, fine-tuning or pre- training with domain-specific data is considered beneficial in most situations. This is also displayed in the number of domain-adapted BERT models (Alsentzer et al., 2019; Huang et al., 2019; Ad- hikari et al., 2019; Lee and Hsiang, 2020; Jain and Ganesamoorty, 2020, (i.a.)), most notably BioBERT (Lee et al., 2020) that was pre-trained on biomedical PubMED articles and SciBERT (Belt- agy et al., 2019) for scientific texts. Xu et al. (2020) used in- and out-of-domain data to pre-train a domain-specific model and adapt it to low-resource domains. Powerful represen- tations can be achieved with a combination of high- resource embeddings with low-resource embed- dings from the target domain (Akbik et al., 2018). Gururangan et al. (2020) showed that continuing the training of an already pre-trained model with additional domain-adaptive and task-adaptive pre- training with unlabeled data leads to performance gains for both high- and low-resource settings for
分享到:
收藏