A Survey on Recent Approaches for Natural Language Processing in
Low-Resource Scenarios
Michael A. Hedderich*1, Lukas Lange*1,2, Heike Adel2,
Jannik Str¨otgen2 & Dietrich Klakow1
1Saarland University, Saarland Informatics Campus, Germany
2Bosch Center for Artificial Intelligence, Germany
{mhedderich,dietrich.klakow}@lsv.uni-saarland.de
{lukas.lange,heike.adel,jannik.stroetgen}@de.bosch.com
0
2
0
2
t
c
O
3
2
]
L
C
.
s
c
[
1
v
9
0
3
2
1
.
0
1
0
2
:
v
i
X
r
a
Abstract
Current developments in natural language pro-
cessing offer challenges and opportunities for
low-resource languages and domains. Deep
neural networks are known for requiring large
amounts of training data which might not be
available in resource-lean scenarios. How-
ever, there is also a growing body of works to
improve the performance in low-resource set-
tings. Motivated by fundamental changes to-
wards neural models and the currently popu-
lar pre-train and fine-tune paradigm, we give
an overview of promising approaches for low-
resource natural language processing. After a
discussion about the definition of low-resource
scenarios and the different dimensions of data
availability, we then examine methods that
enable learning when training data is sparse.
This includes mechanisms to create additional
labeled data like data augmentation and dis-
tant supervision as well as transfer learning
settings that reduce the need for target super-
vision. The survey closes with a brief look
into methods suggested in non-NLP machine
learning communities, which might be benefi-
cial for NLP in low-resource scenarios
Introduction
1
Most of today’s research in natural language pro-
cessing (NLP) is concerned with the processing of
around 10 to 20 high-resource languages with a
special focus on English, and thus, ignores thou-
sands of languages with billions of speakers (Ben-
der, 2019). The rise of data-hungry deep-learning
systems increased the performance of NLP for high
resource-languages, but the shortage of large-scale
data in less-resourced languages makes the pro-
cessing of them a challenging problem. Therefore,
Ruder (2019) named NLP for low-resource scenar-
ios one of the four biggest open problems in NLP
nowadays. According to Ethnologue (Eberhard
* equal contribution
et al., 2019), more than 310 languages exist with
at least one million L1-speakers each. Similarly,
Wikipedia exists for 300 languages, an indicator for
the use of digital infrastructure in these languages.1
But also for languages with fewer speakers, the
availability of NLP tools could help to expand the
language usage and could possibly prevent the ex-
tinction of endangered languages. The treaty on
the functioning of the European Union (TFEU)
emphasizes to keep and disseminate cultural and
linguistic diversity of the Member States (Article
165(1) TFEU). Supporting technological develop-
ments for low-resource languages can help to in-
crease participation of the speakers’ communities
in a digital world. Also, low-resource settings do
not only concern low-resource languages but also
other scenarios, such as non-standard domains and
tasks, for which only little training data is available.
The importance of low-resource scenarios and
the significant changes in natural language pro-
cessing in the last years lead to active research on
resource-lean settings and a wide variety of tech-
niques have been proposed.
In this survey, we
give a structured overview of the literature on low-
resource NLP with a focus on recent works and
also inspecting open issues that are still encoun-
tered. There is a zoo of methods to tackle the
challenges of low-resource NLP. However, which
method is applicable and promising highly depends
on the characteristics of the low-resource scenario.
Thus, one key goal of this survey is to structure the
techniques based on the underlying assumptions
regarding the low-resource setup.
The paper starts with a discussion on the defi-
nition of the term “low-resource” along the three
dimensions of labeled, unlabeled and further auxil-
iary data. An overview of the afterwards discussed
techniques is given in Table 1. Many methods
1https://en.wikipedia.org/wiki/List_
of_Wikipedias
can be used for multiple tasks and domains. Thus,
when referring to language in this paper, this can
also include the specific language used in a domain.
2 Related Work
In the past, surveys on specific methods or cer-
tain low-resource language families have been pub-
lished. These are listed in Table 3 in the Appendix.
As recent surveys on low-resource machine trans-
lation (Liu et al., 2019) and unsupervised domain
adaptation (Ramponi and Plank, 2020) are already
available, we do not investigate them further in
this paper. Instead, our focus lies on general meth-
ods for low-resource, supervised natural language
processing including data augmentation, distant
supervision and transfer learning. This is also in
contrast to the task-specific survey by Magueresse
et al. (2020) who review highly influential work
for several extraction tasks, but only provide little
overview of recent approaches.
3 Defining “Low-Resource”
The umbrella term low-resource covers a spectrum
of scenarios with varying resource conditions. It
includes work on threatened languages, such as
Yongning Na, a Sino-Tibetan language with 40k
speakers and only 3k written, unlabeled sentences
(Adams et al., 2017). But, it also covers work on
specialized domains or tasks in English, which is
often treated as the most high-resource language.
Figure 1 shows exemplarily which NLP tasks
were addressed in six different languages from ba-
sic tasks to higher-level tasks. While it is possible
to build English NLP systems for many higher-
level applications, low-resource languages lack the
data foundation for this. Additionally, even if it
is possible to create basic systems for tasks, such
as tokenization and named entity recognition, for
all tested low-resource languages, the training data
is typically of lower quality compared to the En-
glish datasets, or very limited in size. It also shows
that the four American and African languages with
between 1.5 and 80 million speakers have been
addressed less than the Estonian language, with
1 million speakers. This indicates the unused po-
tential to reach millions of speakers who currently
have no access to higher-level NLP applications.
3.1 Dimensions of Resource Availability
Many techniques presented in the literature depend
on certain assumptions about the low-resource sce-
Figure 1: Supported NLP tasks in different languages.
Note that the figure does not incorporate data quality or
system performance. More details on the selection of
tasks and languages are given in Appendix Section A.
nario. These have to be adequately defined to
evaluate their applicability for a specific setting
and to avoid confusion when comparing differ-
ent approaches. We propose to categorize low-
resource settings along the following three dimen-
sions: availability of (i) task-specific labels, (ii)
unlabeled language text, and (iii) auxiliary data:
(i) The availability of task-specific labels in
the target language (or target domain) is the most
prominent dimension being necessary for super-
vised learning. These labels are usually created
through manual annotation, which can be both time-
and cost-intensive. Not having access to adequate
experts to perform the annotation can also be an
issue for some languages and domains.
(ii) The availability of unlabeled language- or
domain-specific text is another factor to take into
consideration, especially as most modern NLP ap-
proaches are based on some form of input embed-
dings trained on unlabeled texts.
(iii) Most of the ideas surveyed in the next sec-
tions make use of auxiliary data which can have
many forms. Transfer learning might leverage task-
specific labels in a different language or domain.
Distant supervision utilizes external sources of in-
formation, such as knowledge bases or gazetteers.
And some approaches require other NLP tools in
the target language like machine translation to gen-
erate training data. It is essential to consider this as
results from one low-resource scenario might not
be transferable to another one if the assumptions
on the auxiliary data are broken.
3.2 How Low is Low-Resource?
On the dimension of task-specific labels, differ-
ent thresholds are used to define low-resource and
some works use a certain guideline or principle.
Rotman and Reichart (2019) and Kann et al. (2020)
study languages that have less than 10k labeled to-
Method
Requirements
Outcome
Data Augmentation (§ 4.1)
Distant Supervision (§ 4.2)
Cross-lingual projections (§ 4.3)
Embeddings & Pretrained LMs
(§ 5.1)
LM domain adaptation (§ 5.4)
Multilingual LMs (§ 5.3)
Adversarial Discriminator (§ 6)
Meta-Learning (§ 6)
labeled data, heuristics*
unlabeled data, heuristics*
high-
unlabeled
resource
data,
cross-lingual alignment
unlabeled data
data,
labeled
additional labeled data
additional labeled data
additional labeled data
better language representation
existing LM,
unlabeled domain data
multilingual
data
additional datasets
multiple auxiliary tasks
unlabeled
domain-specific language rep-
resentation
multilingual feature represen-
tation
independent representations
better target task performance
For low-resource
languages
domains
Table 1: Overview of low-resource methods surveyed in this paper. * Heuristics are typically gathered manually.
kens in the Universal Dependency project (Nivre
et al., 2020) while Garrette and Baldridge (2013)
limit the time of the annotators to 2 hours result-
ing in up to 1-2k tokens. Loubser and Puttkammer
(2020) report that most available datasets for South
African languages have 40-60k labeled tokens.
The amount of necessary resources is also task-
dependent. While the aforementioned corpus sizes
might be sufficient to train a reasonable POS tagger,
other tasks might also increase the resource require-
ments. For text generation, Yang et al. (2019) frame
their work as low-resource with 350k labeled train-
ing instances. Similar to the task, the resource re-
quirements can also depend on the language. Plank
et al. (2016) find that language families perform
differently given the same amount of limited train-
ing data. Last but not least, the available resources
also influence which approaches perform well. As
shown for POS tagging (Plank et al., 2016) and
text classification (Melamud et al., 2019), in very
low-resource settings, non-neural methods outper-
form more modern approaches while the latter need
several hundred labeled instances. This makes eval-
uations interesting that vary the availability of a
resource like the amount of labeled data (see also
e.g. Lauscher et al. (2020); Hedderich et al. (2020);
Yan et al. (2020)). In this survey, we will not focus
on a specific low-resource scenario but rather spec-
ify which kind of resources the authors assume.
4 Generating Additional Labeled Data
Faced with the lack of task-specific labels, a vari-
ety of approaches have been developed that find
alternative forms of labeled data as replacements
for gold-standard supervision. This is usually done
through some form of expert insights in combina-
tion with automation. These labels tend to be of
lower quality than their manually annotated coun-
terparts and contain more errors or label-noise.
They are, however, easier to obtain, as the man-
ual intervention is focused and limited to setting
up the technique. It is, therefore, suited for low-
resource scenarios. We group these ideas into two
main categories: data augmentation (which uses
task-specific instances to create more of them) and
distant supervision (which labels unlabeled data).
Connected to this, active research exists on learn-
ing with such noisily or weakly labeled data that
tries to handle errors in this automatically created
data and better leverage the additional supervision.
4.1 Data Augmentation
In data augmentation, new labeled instances are
created by modifying the features of existing in-
stances with transformations that do not change
the label of an instance. In the computer vision
community, this is a popular approach where, e.g.,
rotating an image is invariant to the classification
of an image’s content. For text, this can be done by
replacing words with equivalents from a collection
of synonyms (Wei and Zou, 2019), entities of the
same type (Raiman and Miller, 2017; Dai and Adel,
2020) or words that share the same morphology
(Gulordava et al., 2018; Vania et al., 2019). This re-
placement can also be guided by a language model
that takes context into consideration (Kobayashi,
2018; Fadaee et al., 2017). Data augmentation can
also be performed on sentence parts based on their
syntactic structure. S¸ahin and Steedman (2018)
and Vania et al. (2019) rotate parts of the depen-
dency tree that allow such operations to obtain sen-
tence variations. The former also simplify exist-
ing, labeled sentences by removing sentence parts.
The subject-object relation within a sentence is in-
versed by Min et al. (2020). On the level of vector
representations, Yasunaga et al. (2018) add small
perturbations via adversarial learning to existing
sentences, preventing overfitting in low-resource
POS tagging. Cheng et al. (2020) show how virtual
training sentences can be created using a generative
model to interpolate between two given sentences.
Data augmentation can also be achieved by mod-
ifying instances so that the label changes. For gram-
mar correction, Grundkiewicz et al. (2019) use cor-
rect sentences and apply a set of transformations
that introduce errors. A third option is the genera-
tion of features based on the labels. This is a pop-
ular approach in machine translation where target
sentences are back-translated into source sentences
(Bojar and Tamchyna, 2011; Hoang et al., 2018).
An important aspect here is that errors in the source
side/features do not seem to have a large negative
effect on the generated target text the model needs
to predict. It is therefore also used in other text gen-
eration tasks like abstract summarization (Parida
and Motlicek, 2019) and table-to-text generation
(Ma et al., 2019). For sentence classification, a
label-dependent language model can be fine-tuned
on a small labeled dataset to generate labeled train-
ing sentences (Kumar et al., 2020; Anaby-Tavor
et al., 2020). Kafle et al. (2017) train a model to
create new questions for visual question-answering.
While data augmentation is ubiquitous in the
computer vision community, it has not found such
a widespread in natural language processing and is
often evaluated only on selected tasks, especially
compared to the feature representation approaches
of Section 5. A reason might be that the presented
techniques are often task-specific requiring either
hand-crafted systems or training of task-dependent
language models. Data augmentation that can be
applied across tasks and languages is an interesting
aim for future research.
4.2 Distant Supervision
In contrast to data augmentation, distant or weak
supervision uses unlabeled data and keeps it un-
modified. The corresponding labels are obtained
through a (semi-)automatic process from an ex-
ternal source of information. For named entity
recognition (NER), a list of location names might
be obtained from a dictionary and matches of to-
kens in the text with entities in the list are auto-
matically labeled as locations. Distant supervision
was introduced by Mintz et al. (2009) for relation
extraction (RE) with extensions on multi-instance
(Riedel et al., 2010) and multi-label learning (Sur-
deanu et al., 2012). It is still a popular approach
for information extraction tasks like NER and RE
where the external information can be obtained
from knowledge bases, gazetteers, dictionaries and
other forms of structured knowledge sources (Luo
et al., 2017; Hedderich and Klakow, 2018; Deng
and Sun, 2019; Alt et al., 2019; Ye et al., 2019;
Lange et al., 2019a; Nooralahzadeh et al., 2019;
Le and Titov, 2019; Cao et al., 2019; Lison et al.,
2020). The automatic annotation can go from sim-
ple string matching (Yang et al., 2018) to complex
pipelines including classifiers and manual steps
(Norman et al., 2019). This distant supervision us-
ing information from external knowledge sources
can be seen as a subset of the more general ap-
proach of labeling rules. These encompass also
other ideas like reg-ex rules or simple program-
ming functions (Ratner et al., 2017; Zheng et al.,
2019; Adelani et al., 2020; Lison et al., 2020). For
deciding on the complexity of the distant supervi-
sion technique, required effort and necessary label
quality need to be taken into account.
Instead of using information from external
knowledge sources to generate labels, it can also
be used as additional features, e.g., for NER in
low-resource languages (Plank and Agi´c, 2018;
Rijhwani et al., 2020). As an alternative to an
automatic annotation process, annotations might
also be provided by non-experts. Similar to distant
supervision, this results in a trade-off between la-
bel quality and availability. For instance, Garrette
and Baldridge (2013) obtain labeled data from non-
native-speakers and without a quality control on
the manual annotations. This can be taken even fur-
ther by employing annotators who do not speak the
low-resource language (Mayhew and Roth, 2018;
Mayhew et al., 2019; Tsygankova et al., 2020).
While distant supervision is popular for informa-
tion extraction tasks like NER or RE, it is less preva-
lent in other areas of NLP. Nevertheless, distant su-
pervision has also been successfully employed for
other tasks by proposing new ways for automatic
annotation. Li et al. (2012) leverage a knowledge
base for POS tagging. Wang et al. (2019) use con-
text by transferring a document-level sentiment
label to sentence-level. Huber and Carenini (2019)
build a discourse-structure dataset using guidance
from a sentiment classifier. For topic classification,
Bach et al. (2019) use heuristics and inputs from
other classifiers like NER. Distant supervision is
also popular in other fields like image classification
(Xiao et al., 2015; Li et al., 2017; Lee et al., 2018;
Mahajan et al., 2018; Li et al., 2020). This sug-
gests that distant supervision could be leveraged
for more NLP tasks in the future.
Distant supervision methods heavily rely on aux-
iliary data. In a low-resource setting, it might be dif-
ficult to obtain not only labeled data but also such
auxiliary data. Kann et al. (2020) find a large gap
between the performance on high-resource and low-
resource languages for POS pointing to the lack of
high-coverage and error-free dictionaries for the
weak supervision in low-resource languages. This
emphasizes the need for evaluating such methods
in a realistic setting and avoiding to just simulating
restricted access to labeled data in a high-resource
language. A discussion of approaches to leverage
flawed distant supervision is given in Section 4.4.
While distant supervision allows obtaining la-
beled data more quickly than manually annotat-
ing every instance of a dataset, it still requires
human interaction to create automatic annotation
techniques or provide labeling rules. This time and
effort could also be spent on annotating more gold
label data, either naively or through an active learn-
ing scheme. Unfortunately, distant supervision pa-
pers seldomly provide information on how long
the creation took, making it difficult to compare
these approaches. Taking the human expert into the
focus connects this research direction with human-
computer-interaction and human-in-the-loop setups
like (Klie et al., 2018).
4.3 Cross-Lingual Projections
For cross-lingual projections, a task-specific clas-
sifier is trained in a high-resource language. Us-
ing parallel corpora, the unlabeled low-resource
data is then aligned to its equivalent in the high-
resource language where labels can be obtained
using the aforementioned classifier. These labels
(on the high-resource text) can then be projected
back to the text in the low-resource language based
on the alignment between tokens in the parallel
texts (Yarowsky et al., 2001). This approach can,
therefore, be seen as a form of distant supervi-
sion specific for obtaining labeled data for low-
resource languages. Cross-lingual projections have
been applied in low-resource settings for tasks,
such as POS tagging and parsing (T¨ackstr¨om et al.,
2013; Wisniewski et al., 2014; Plank and Agi´c,
2018). Instead of using parallel corpora, existing
high-resource labeled datasets can also be machine-
translated into the low-resource language (Khalil
et al., 2019; Zhang et al., 2019a; Fei et al., 2020).
These cross-lingual projections set high require-
ments on the auxiliary data requiring both, labels
in a high-resource language and means to project
them into a low-resource language. Especially
the latter might be an issue as machine translation
might be problematic for a specific low-resource
language by itself. Sources for parallel text can be
the OPUS project (Tiedemann, 2012), Bible cor-
pora (Mayer and Cysouw, 2014; Christodoulopou-
los and Steedman, 2015) or the recently presented
JW300 corpus Agi´c and Vuli´c (2019) with paral-
lel texts for over 300 languages. A limitation are
the limited domains of these datasets like politi-
cal proceedings or religious texts. Mayhew et al.
(2017) propose to use a simpler, word and lexicon-
based translation instead and Fang and Cohn (2017)
present a system based on bilingual dictionaries.
4.4 Learning with Noisy Labels
Distantly supervised labels might be quicker and
cheaper to obtain than manual annotations, but they
also tend to contain more errors. Even though more
training data is available, training directly on this
noisily-labeled data can actually hurt the perfor-
mance. Therefore, many recent approaches use a
noise handling method to diminish the negative ef-
fects of distant supervision. We categorize these
into two ideas: noise filtering and noise modeling.
Noise filtering methods remove instances from
the training data that have a high probability of
being incorrectly labeled. This often includes train-
ing a classifier to make the filtering decision. The
filtering can remove the instances completely from
the training data, e.g., through a probability thresh-
old (Jia et al., 2019), a binary classifier (Adel and
Sch¨utze, 2015; Onoe and Durrett, 2019; Huang and
Du, 2019) or the use of a reinforcement-based agent
(Yang et al., 2018; Nooralahzadeh et al., 2019). Al-
ternatively, a soft filtering might be applied that
re-weights instances according to their probability
of being correctly labeled (Le and Titov, 2019) or
an attention measure (Hu et al., 2019).
The noise in the labels can also be modeled. A
common model is a confusion matrix estimating
the relationship between clean and noisy labels
(Fang and Cohn, 2016; Luo et al., 2017; Hedderich
and Klakow, 2018; Paul et al., 2019; Lange et al.,
2019a,b; Chen et al., 2019; Wang et al., 2019). The
classifier is no longer trained directly on the noisily-
labeled data. Instead, a noise model is appended
which shifts the noisy to the (unseen) clean label
distribution. This can be interpreted as the original
classifier being trained on a ”cleaned” version of
the noisy labels. In Ye et al. (2019), the prediction
is shifted from the noisy to the clean distribution
during testing. In Chen et al. (2020) a group of rein-
forcement agents relabels noisy instances. Rehbein
and Ruppenhofer (2017) and Lison et al. (2020)
leverage several sources of distant supervision and
learn how to combine them.
Especially for NER, the noise in distantly super-
vised labels tends to be false negative errors, i.e.,
entities that were not annotated. Partial annotation
learning (Yang et al., 2018; Nooralahzadeh et al.,
2019; Cao et al., 2019) takes this into account ex-
plicitly. Related approaches learn latent variables
(Jie et al., 2019), use constrained binary learning
(Mayhew et al., 2019) or construct a loss under the
assumption that only unlabeled positive instances
are available (Peng et al., 2019).
5 Transfer Learning
While distant supervision and data augmentation
generate and extend task-specific training data,
transfer learning reduces the need for labeled tar-
get data by transferring learned representations and
models. This has two aspects: (i) Unlabeled data
can be used to obtain pre-trained language repre-
sentations. A strong focus in recent works on trans-
fer learning in NLP lies in the use of pre-trained
language representations like BERT (Devlin et al.,
2019) as well as the training of domain-specific
or multilingual representations. (ii) Auxiliary data
can be used to train and transfer models from re-
lated tasks in the same language, or the same (or
similar) task from other domains or languages.
5.1 Pre-trained Language Representations
Feature vectors are the core input component of
many neural network-based models for NLP tasks:
numerical representations of words or sentences,
as neural architectures do not allow the process-
ing of strings and characters as such. Collobert
et al. (2011) showed, that training these models for
the task of language-modeling on a large-scale cor-
pus results in high-quality word representations,
that can be reused for other downstream tasks
as well. Thus, pre-trained language representa-
tions have become essential components of neu-
ral NLP models, as these models can leverage
information learned from large-scale, unlabeled
text sources. Starting with the work by Collobert
et al. (2011) and Mikolov et al. (2013a), word em-
beddings are pre-trained on large background cor-
pora. Subword-based embeddings such as fast-
Text n-gram embeddings (Bojanowski et al., 2017)
and byte-pair-encoding embeddings (Heinzerling
and Strube, 2018) addressed out-of-vocabulary is-
sues and enabled the processing of texts in many
languages, including multiple low-resource lan-
guages, as pre-trained embeddings were published
for more than 270 languages covered in Wikipedia
for both embedding methods. Zhu et al. (2019)
showed that embeddings using subword informa-
tion, e.g., n-grams or byte-pair-encodings, are ben-
eficial for low-resource sequence labeling tasks,
such as named entity recognition and typing, and
outperform word-level embeddings.
Jungmaier
et al. (2020) added smoothing to word2vec models
to correct its bias towards rare words and achieved
improvements in particular for low-resource set-
tings. However, data quality for low-resource, even
for unlabeled data, might not be comparable to data
from high-resource languages. Alabi et al. (2020)
found that word embeddings trained on massive
amounts of unlabeled data from low-resource lan-
guages are not competitive to embeddings trained
on smaller, but curated data sources.
More recently, a trend emerged of pre-training
large embedding models using a language model
objective to create context-aware word represen-
tations. Starting with the recurrent ELMo model
(Peters et al., 2018), the trend evolved towards pre-
trained transformer models (Vaswani et al., 2017),
such as BERT (Devlin et al., 2019), GPT2 (Rad-
ford et al., 2019) or RoBERTa (Liu et al., 2019b).
These models generate representations by levering
multiple words of a sequence to generate a con-
textualized representation. By pre-training on a
large-scale dataset of unlabeled texts, these mod-
els can overcome a lack of labeled data by using
prior learned knowledge, enabling usage in a low-
resource scenario (Cruz and Cheng, 2019).
While pre-trained language models achieve sig-
nificant performance increases, it is still question-
able if these methods are suited for real-world low-
resource scenarios. For example, all of these mod-
els require large hardware requirements, in partic-
ular, considering that the transformer model size
keeps increasing to boost performance (Raffel et al.,
2019). Therefore, these large-scale methods might
not be suited for low-resource scenarios where
hardware is also low-resource. van Biljon et al.
(2020) showed that low- to medium-depth trans-
former sizes are most beneficial for low-resource
languages and Melamud et al. (2019) showed that
simple bag-of-words approaches are better when
there are only a few dozen training instances or
less for text for classification, while more complex
transformer models require more training data.
5.2 Multilingual Transfer
Low-resource languages can also benefit from la-
beled resources available in other high-resource
languages. Most often, neural models for cross-
and multilingual processing are transferred by cre-
ating a common multilingual embedding space of
multiple languages. This can be done by, e.g., com-
puting a mapping between two different embedding
spaces, such that the words in both embeddings
share similar feature vectors after the mapping, the
so-called alignment (Mikolov et al., 2013b; Joulin
et al., 2018; Ruder et al., 2019). For example,
monolinguals embeddings can be aligned to the
same space. Then, the original embeddings can
be replaced with a different, but alignment embed-
dings. Assuming a perfect alignment, the model
would still perform well on the new unseen lan-
guage, as the feature vectors point in known direc-
tions. Smith et al. (2017) proposed to align embed-
dings by creating transformation matrices based on
bilingual dictionaries to map different embeddings
to a common space. Zhang et al. (2019b) created
bilingual representations by creating cross-lingual
word embeddings using a small set of parallel sen-
tences for low-resource document retrieval.
5.3 Multilingual Language Models
Unsupervised pre-training enabled another fruitful
research direction:
the training of highly multi-
lingual representations in a single model, such as
multilingual BERT (Devlin et al., 2019) or XLM-
RoBERTa (Conneau et al., 2020). These models
are trained using unlabeled, monolingual corpora
from different languages and can be used in cross-
and multilingual settings, due to many languages
seen during pre-training. In cross-lingual zero-shot
learning, no task-specific labeled data is available
in the low-resource target language. Instead, la-
beled data from a high-resource language is lever-
aged. Hu et al. (2020) showed, however, that there
is still a large gap between low and high-resource
setting. Lauscher et al. (2020) and Hedderich et al.
(2020) proposed adding additionally a minimal
amount of target-task and -language data (in the
range of 10 to 100 labeled sentences) which re-
sulted in a significant boost in performance.
However, recently it was questioned, whether
these models are truly multilingual and evidence
was found that language-specific information is
stored at least in the upper transformer layers and
more general, language-independent representa-
tions in the lower layers. (Pires et al., 2019; Singh
et al., 2019; Libovick`y et al., 2020). Nonetheless,
these models are state of the art in the cross-lingual
transfer (Wu and Dredze, 2019), including transfer
to unseen low-resource languages (K et al., 2020;
Liu et al., 2020). Further, it was shown that trans-
former models can also benefit from alignment
either similar to traditional non-contextualized em-
beddings by projecting to the same space (Wu
et al., 2019), or also by aligning the languages
inside a single multilingual model, i.a., in cross-
lingual (Schuster et al., 2019; Liu et al., 2019a) or
multilingual settings (Cao et al., 2020).
While these models are a tremendous step to-
wards enabling NLP in many languages, possible
claims that these are universal language models do
not hold. For example, mBERT covers 104 and
XLM-R 100 languages, which is a third of all lan-
guages in Wikipedia as outlined earlier. A notable
exception are the multilingual FLAIR embeddings
that were trained on more than 380 languages from
the JW300 and OPUS corpora, but these are typ-
ically used within the limited scope of sequence
labeling compared to the broad usage of BERT
models. Further, Wu and Dredze (2020) showed
that, in particular, low-resource languages are not
well-represented in mBERT. Figure 2 shows which
language families with at least 1 million speakers
are covered by mBERT and XLM-RoBERTa 2. In
particular, African and American languages are
not well-represented within the transformer mod-
2A language family is covered if at least one associated
language is covered. Language families can belong to multiple
regions, e.g. Indo-European belongs to Europe+Asia
numerous English domains and tasks. Aharoni
and Goldberg (2020) found domain-specific clus-
ters in pre-trained language models and showed
how these could be exploited for data selection in
domain-sensitive training.
6 Other Machine Learning Approaches
Training on a limited amount of data is not unique
to natural language processing. Other areas, like
general machine learning and computer vision, can
be a useful source for insights and new ideas. One
of these is Meta-Learning (Finn et al., 2017), which
is based on multi-task learning. Given a set of aux-
iliary high-resource tasks and a low-resource target
task, meta-learning trains a model to decide how
to use the auxiliary tasks in the most beneficial
way for the target task. Yu et al. (2018) evaluated
such a concept on sentiment and user intent clas-
sification. Dou et al. (2019) applied this approach
to natural language understanding tasks from the
GLUE benchmark. Instead of having a set of tasks,
Rahimi et al. (2019) built an ensemble of language-
specific NER classifiers which are then weighted
depending on the zero- or few-shot target language.
Differences in the features between the pre-
training and the target domain can be an issue in
transfer learning, especially in neural approaches
where it can be difficult to control which informa-
tion the model takes into account. Adversarial dis-
criminators (Goodfellow et al., 2014) can prevent
the model from learning a feature-representation
that is specific to a data source. Gui et al. (2017),
Liu et al. (2017), Kasai et al. (2019), Grießhaber
et al. (2020) and Zhou et al. (2019) learned domain-
independent representations using adversarial train-
ing. Kim et al. (2017), Chen et al. (2018) and Lange
et al. (2020) worked with language-independent
representations for cross-lingual transfer. These
examples show the beneficial exchange of ideas be-
tween NLP and the machine learning community.
7 Conclusion
In this survey, we gave a structured overview of
recent work in the field of low-resource natural lan-
guage processing. We showed that it is essential to
analyze resource-lean scenarios across the differ-
ent dimensions of data-availability. This can reveal
which techniques are expected to be applicable and
helpful in a specific low-resource setting. We hope
that our discussions on open issues for the different
approaches can serve as inspiration for future work
Figure 2: Language families with more than 1 million
speakers covered by multilingual transformer models.
els, even though millions of people speak these lan-
guages. This can be problematic, as languages from
more distant language families are less suited for
transfer learning, as Lauscher et al. (2020) showed.
5.4 Domain-Specific Pretraining
The existing pre-trained transformer models also
outperformed existing models on a wide range of
domains. For example, Chronopoulou et al. (2019)
and Dirkson and Verberne (2019) transferred pre-
trained language models to twitter texts. Even
though the respective models were pre-trained on
general-domain data, such as texts from the news
or web-domain, the models can be successfully
transferred to texts from unseen domains, such as
the clinical domain (Sun and Yang, 2019) or the
materials science domain (Friedrich et al., 2020).
Lin et al. (2020) found that the plain BERT model
performed best in their setup for clinical negation
detection compared to several models fine-tuned on
the clinical domain. However, fine-tuning or pre-
training with domain-specific data is considered
beneficial in most situations. This is also displayed
in the number of domain-adapted BERT models
(Alsentzer et al., 2019; Huang et al., 2019; Ad-
hikari et al., 2019; Lee and Hsiang, 2020; Jain
and Ganesamoorty, 2020, (i.a.)), most notably
BioBERT (Lee et al., 2020) that was pre-trained on
biomedical PubMED articles and SciBERT (Belt-
agy et al., 2019) for scientific texts.
Xu et al. (2020) used in- and out-of-domain data
to pre-train a domain-specific model and adapt
it to low-resource domains. Powerful represen-
tations can be achieved with a combination of high-
resource embeddings with low-resource embed-
dings from the target domain (Akbik et al., 2018).
Gururangan et al. (2020) showed that continuing
the training of an already pre-trained model with
additional domain-adaptive and task-adaptive pre-
training with unlabeled data leads to performance
gains for both high- and low-resource settings for