Journal of Applied Mathematics and Physics, 2018, 6, 1558-1570
http://www.scirp.org/journal/jamp
ISSN Online: 2327-4379
ISSN Print: 2327-4352
Revisit Language Modeling Competition and
Extinction: A Data-Driven Validation
Chosila Sutantawibul1, Pengcheng Xiao2*, Sarah Richie1, Daniela Fuentes-Rivero2
1Department of Physics, University of Evansville, Evansville, USA
2Department of Mathematics, University of Evansville, Evansville, USA
How to cite this paper: Sutantawibul, C.,
Xiao, P.C., Richie, S. and Fuentes-Rivero,
D. (2018) Revisit Language Modeling Com-
petition and Extinction: A Data-Driven
Validation. Journal of Applied Mathematics
and Physics, 6, 1558-1570.
https://doi.org/10.4236/jamp.2018.67132
Received: May 7, 2018
Accepted: July 27, 2018
Published: July 30, 2018
Copyright © 2018 by authors and
Scientific Research Publishing Inc.
This work is licensed under the Creative
Commons Attribution International
License (CC BY 4.0).
http://creativecommons.org/licenses/by/4.0/
Open Access
Abstract
Understanding language competition and extinction is an interdisciplinary
challenge, and math models provide a tool for interpreting linguistic census
data and possibly predict the language shift trend at the population scale. In
this study, new data from previously examined areas were modeled, specifi-
cally Catalan and Spanish in Catalonia, Spanish and English in Houston, Tex-
as, Dutch and French in Brussels, Euskera and Spanish in Spain and French
and English in Canada. Three mathematical models of the language competi-
tion have been validated. The first is the Abrams-Strogatz model, which treats
populations as having two monolingual groups. The second is the Castelló
model, which considers bilingual speakers. The third is the Mira model, which
considers language competition when the two languages have high similari-
ties. It was found that the some of the data matched Abrams-Strogatz original
model, but some divergences could still be addressed. It was also found that
the Mira model needs some improvement in how it treats the differences be-
tween languages.
Keywords
Language Competition Model, Parameter Estimation, Nonlinear Regression
1. Introduction
Throughout history, languages have been significantly morphed or have gone
extinct. It is estimated that 90% of the languages that exist today are expected to
be extinct within the next generation [1]. This is due to a multitude of reasons
such as empires conquering regions and coercing the inhabitants to speak their
language and globalization in which native speakers must learn the languages of
their neighbors. This leads to bilingual speakers and in some cases death of a
language entirely [2] [3] [4].
1558
Journal of Applied Mathematics and Physics
DOI: 10.4236/jamp.2018.67132 Jul. 30, 2018
C. Sutantawibul et al.
1.1. Language Modeling
Many mathematical models have been proposed to describe the dynamics of
competition between two languages in a given region. There are two primary
type of models describing language competition: microscopic and macroscopic.
Macroscopic models of language competition treat the population as homogeneous
(all members are the same, and evenly dispersed in an area) and fully connected
(all members interact with other members) [5]. These macroscopic models are
usually described by differential equations. Microscopic models treat individual
persons as nodes in a network, allowing each node to be connected to a certain
number of other nodes, as well as individual transitioning probabilities. In this
paper, we focus on three macroscopic version of language competition models:
Abrams-Strogatz Model, Castelló Model, and Mira Model.
1.2. Abrams-Strogatz Model
Abrams and Strogatz proposed one of the first models to describe language
competition using statistical physics and complex systems, which fueled other
models of similar ideas to be published as well [6]. Abrams and Strogatz
proposed a model, Abrams-Strogatz (AS) model, that describes language
competition similar to the Lotka-Volterra predator-prey model, except both
languages act as predator and prey to each other, as one speaker could switch
from one language to another and vice versa. It assumes that the population is
homogeneous and fully connected, which may not represent rural areas with
sparse population or geographical separation, but could describe densely
populated areas like cities. This model is described by the differential equations
[7]:
x
d
t
d
y
d
t
d
=
=
yP x s
,
yx
(
xP x s
,
xy
(
)
−
P x s
,
xy
(
)
)
−
P x s
,
yx
(
)
(1)
,x y are the fraction of the population speaking languages x and y
where
respectively, which means that the sum of the two fractions should equal one.
xyP is the probability that an individual would switch from speaking language
x to language y . This probability is defined by:
P
yx
P
xy
a
=
sx
(
1
= −
(2)
)(
1
s
−
a
x
)
,
where a is the volatility of a language, or how easy it is for an individual to
switch over to the other language, and s is the prestige of the language, which
is how attractive a language is to switch to. These two parameters are acquired
by fitting this model to the data of population speaking a specific language in an
area.
Equation (1) could be viewed as rate equations, where the change in population
of language x is simply the population of language y times the probability of
1559
Journal of Applied Mathematics and Physics
DOI: 10.4236/jamp.2018.67132
C. Sutantawibul et al.
people speaking y to change to x (people speaking y changing to speak
x ) minus the population of language x times the probability of changing from
x to y (people speaking x changing to y ). This model considers the
speakers of each language to be strictly monolingual.
At high volatility (
1a > ), the few stable state (fraction of population speaking
one language and the other no longer changes) of this model are when the entire
s ≠
population speaks one language while the other dies (
) and when both
0.5
languages have the same amount of speakers (
). Since the condition for
stability where both languages survive is so precise, the AS model almost always
predicts that one language will eventually go extinct while the rest of the
population adapts to the other language.
0.5
s =
1.3. Castelló Model
Inspired by the original proposal of Wang and Minett [8], Castelló, et al.’s model
extends the AS model by considering a third possible state which the population
could be in, which is bilingual. This allows the population to change from
speaking the only language x to speaking both languages b to speaking the
only language y , and vice versa. This model also assumes homogeneous and
fully connected population. The presence of a third intermediate state slows down
the process of language extinction, but still does not indefinitely prevent it [9].
The differential equations that describe this model are:
x
d
t
d
y
d
t
d
b
d
t
d
=
yP
YX
+
bP
BX
−
(
x P
XY
+
P
XB
)
=
xP
XY
+
bP
BY
−
(
y P
YX
+
P
YB
=
xP
XB
+
yP
YB
−
(
b P
BX
+
P
BY
)
)
(3)
Again, these equations are simply rate equations, with the probabilities:
)
s xy
a
P
XB
P
YB
P
BX
P
BY
(
1
= −
a
syx
=
(
s
1
=
(
1
= −
x
− −
)(
s
1
)(
y
1
x
− −
(4)
a
−
y
)
y
)(
1
−
a
x
)
Qualitative and quantitative analyses were both explored on complex networks
and two-dimensional square lattices, and details in Ref. [10]. Castelló, et al. found
that there exists a transition from one language dominance state to language
coexistence state, and maintaining the coexistence state is very challenging under
the bilinguals situation. The parameters in this model are also acquired by fitting
the model to data.
1.4. Mira Model
Mira, et al.’s model is also an extension of the AS model. This model adds to the
AS model by 1) introducing bilingual speakers, and 2) introducing an extra
1560
Journal of Applied Mathematics and Physics
DOI: 10.4236/jamp.2018.67132
C. Sutantawibul et al.
factor that describes the similarity between the two languages, k, a , s, and k
are all acquired by fitting the model to the data as well. Mira talks about the
possibility of calculating k based on the similarity of the language, such as
words, grammar, and structure. Mira had k = 1 to be the situation where the
languages are identical and k = 0 to be where the languages are entirely different. The
process of calculating can be very complicated and has yet to be developed [11].
The differential equation that describes Mira’s model is the same as Equation
(3), but the transition probabilities are different, as the transition probabilities
must contain the k value [12]:
P
XB
P
XY
P
YB
P
YX
(
ks
1
−
Y
)
(
k s
1
−
Y
(
y
ks
1
−
)
(
k s
1
(5)
a
)
(
1
)
a
(
1
X
−
y
)
a
x
)
=
=
=
=
−
−
x
X
a
Mira’s work focus on the time evolution of two coexisting languages
(Castillian Spanish and Galician) under the framework of AS model. It claims
that if the languages in the competition are similar enough, then a stable
bilingual situation is possible. A sufficiently large value of k is needed for this
particular situation [6] [12].
1.5. Questions to Be Answered
While the models thus far have found the volatility to be constant to fit their
model, this was something that could still be examined with more data. Also, the
prestige of other languages could be determined if other data sets were
considered. The other question was how these models could be added upon or
improved. Given the full range of areas where language competition exists,
looking at more data sets would lend to more possibilities for improving these
models, especially Mira and Castelló’s models. In this research work, we focus
on the macroscopic model. Macroscopic modeling was also more frequently
reported, so it would be easier to check if our results are accurate.
The paper is organized as follows. Section 2 describes the method for the
model validation. The first part is devoted to introduce the method we used for
computing the parameters, while the second part describes the accumulated data
from eight different regions. In Section 3, we carry on parameters fitting results
based on the data from Section 2. The paper concludes with a discussion in
Section 4.
2. Method
All the models will be coded and fitted using MATLAB. The differential
equations will be solved using ode 45. Ode 45 only has medium accuracy, so ode
113 is used when higher accuracy is needed. To find parameters, lsqcurvefit is
used. lsqcurvefit uses the least squares regression analysis which computes the
distance from the fitted curve to the data point and finds the parameters that
1561
Journal of Applied Mathematics and Physics
DOI: 10.4236/jamp.2018.67132
C. Sutantawibul et al.
DOI: 10.4236/jamp.2018.67132
allow for that distance to be the smallest. This method is different from Abrams
and Strogatz’s method, as they wrote their own routines to numerically compute
the differential equation as well as well as their own routine to compute the
parameters. They used least absolute value regression, rather than least square
regression, to compute their parameters, which may lead to discrepancies in
their acquired parameters and our parameters [7].
Data Accumulation
The dataset that we considered were of those that have direct language
competition, which means that other languages are spoken in the area beside the
main two account for less than 10% of the population. Most data were taken
from country censuses, and some data manipulation was required.
1) Welsh-English
This data (Table 1) [13] was chosen because it was one of the data that
Abrams and Strogatz fit in their paper. The results from fitting this could be
compared to the original paper.
2) Gaelic-English
This is also one of the data (Table 2) [13] that Abrams and Strogatz fit in their
paper. The results from fitting this could be compared to their work.
3) Euskera-Spanish (Spain)
Euskera and Spanish are the two main languages spoken in northern Spain.
This data (Table 3) only consider people who speak either Euskera, Spanish, or
both. People who do not speak either are not accounted for. This data were
taken from Sociolinguistic Maps Reports [14].
4) French-English (Canada)
People who speak neither French nor English were not accounted for in this
dataset. Canadian government has policies that support their citizens to be
bilingual, as well as preserve both languages. Data (Table 4) were taken from
Statistics of Canada [15].
Table 1. Welsh-English data.
Year
1901
1911
1921
1931
1951
1961
1971
1981
1991
2001
Welsh (%)
15.0
8.0
6.0
4.0
2.0
1.0
1.0
1.0
0.0
0.0
English (%)
Bilingual (%)
50.0
57.0
63.0
63.0
71.0
74.0
79.0
81.0
81.0
79.0
35.0
35.0
31.0
33.0
27.0
25.0
20.0
18.0
19.0
21.0
1562
Journal of Applied Mathematics and Physics
C. Sutantawibul et al.
Table 2. Gaelic-English data.
Year
1891
1901
1911
1921
1931
1951
1961
1971
Gaelic (%)
English (%)
Bilingual (%)
5.2
2.2
9.3
3.8
1.5
0.1
0.1
0.0
27.6
32.1
41.3
47.8
56.0
75.7
82.7
86.0
67.2
65.7
57.7
51.9
43.9
24.3
17.3
14.0
Table 3. Euskera-Spanish data.
Year
1991
2001
2006
2011
2016
Euskera (%)
Spanish (%)
Bilingual (%)
10.0
13.5
12.5
12.7
13.4
84.5
78.2
81.4
80.0
79.5
5.5
8.3
6.1
7.3
7.1
Table 4. French-English data.
Year
1996
2001
2006
2011
2016
French (%)
English (%)
Bilingual (%)
68.2
68.6
68.8
69.4
69.6
14.5
13.5
13.5
12.8
12.15
17.3
17.9
17.7
17.8
18.2
5) French-English (Montreal)
Since the models assume even density within the population, we decided to
also look at Montreal, which is a fairly dense city. Values from this dataset could
be compared to values calculated from all of Canada. Data (Table 5) were also
taken from Statistics of Canada [15].
6) Spanish-English (Houston)
We looked at English and Spanish spoken in Houston, Texas. The data (Table
6) were taken from the American census [16].
7) Catalan-Spanish
Catalan and Spanish are very closely related, such that if a person speaks
Spanish, they will be able to understand someone else speaking Catalan. We
decided to choose this dataset specifically to use in Mira model, where there is a
parameter for the similarity between two languages. This may be challenging as
1563
Journal of Applied Mathematics and Physics
DOI: 10.4236/jamp.2018.67132
C. Sutantawibul et al.
Table 5. French-English data.
Year
1996
2001
2006
2011
2017
French (%)
English (%)
Bilingual
8.7
7.7
7.5
7.6
7.2
40.6
38.5
39.8
37.7
36.9
50.7
53.8
52.8
54.8
55.9
Table 6. Spanish-English data.
Year
1970
1980
1990
2000
2014
Spanish (%)
English (%)
2.6
13.0
30.0
34.0
38.0
97.4
83.0
70.0
76.0
54.0
the census only goes back to 2003. This data (Table 7) were taken from Language
Use of the Population of Catalonia [17].
8) French-Dutch(Brussels)
The French and Dutch data spoken in Brussels, Belgium. This dataset (Table
8) [18] may not be very accurate, as the census only indicates knowledge of each
language, and not if a person is bilingual or not.
3. Results
3.1. Abrams-Strogatz Model
Table 9 summarizes the fitted parameters of the different language competitions
using the AS model. The second s value was calculated by subtracting the first s
value from 1. Most of the s values for the first language are in the mid-range
6s≤ ≤ ) except for the competition between Spanish and Euskera in Spain,
( 4
. The a values acquired were
where
1a ≤ ), since Abrams and Strogatz got a values that were
unexpectedly low (
close to 1.33. This could be caused by the fact that we used a different fitting
routine than Abrams and Strogatz.
0.7538
s
Euskera
=
0.2462
s
Spanish
=
and
Since the initial values for each parameter were randomized, which could
affect the outcome of the parameters. This happened in French/English
(Canada), French/English (Montreal), French/Dutch, Spanish/English, and
Spanish/Euskera. Parameters calculated for these datasets turn out to be entirely
different depending on the initial value for the parameter. This behavior does
not show in datasets Welsh/English and Spanish/English. This is because the two
datasets show the population increasing/decreasing in the rapid growth/decay
part of the curve, while the others did not show large change in a fraction of the
1564
Journal of Applied Mathematics and Physics
DOI: 10.4236/jamp.2018.67132
C. Sutantawibul et al.
Table 7. Catalan-Spanish data.
Year
2003
2008
2013
Spanish (%)
Catalan (%)
Bilingual (%)
46.0
35.6
36.3
49.0
52.4
56.6
4.7
12.0
6.8
Table 8. French-Dutch data.
Year
1842
1846
1866
1880
1890
1900
1910
1920
1930
1947
French (%)
Dutch (%)
37.6
28.4
20.0
25.0
20.1
23.0
16.4
8.2
12.0
9.6
60.8
60.3
39.1
26.4
23.0
19.7
26.7
32.8
33.6
35.3
Table 9. Parameters of different language competitions using the Abrams-Strogatz model.
Languages
French/English (Canada)
French/English (Montreal)
French/Dutch
Gaelic/English
Spanish/English
Spanish/Euskera
Welsh/English
s
0.5959
0.5754
0.4663
0.4828
0.4832
0.7538
0.4885
s
0.4041
0.4246
0.5337
0.5172
0.5168
0.2462
0.5115
a
1.5110
0.8831
0.8537
1.0159
0.8439
0.1850
0.9817
population over the years, or the dataset only contained data for a short period.
The determination of a and s depends heavily on the shape and length of the
rapid growth/decay region of the graphs, so without sufficient data in that region,
the values of a and s could vary depending on what initial value was given to
lsqcurvefit. This problem applies to all three models.
These parameters calculated were also used to predict the outcome of the
competition between each language. The AS model expectedly predicts that one
language will disappear except for French/Dutch in Brussels, and Spanish/English
in Houston. For the case of Brussels, this result could be from the fact that the
data itself was faulty, because the census was not consistent, and the data did not
show a steady growth/decay like the model expects.
1565
Journal of Applied Mathematics and Physics
DOI: 10.4236/jamp.2018.67132