logo资料库

论文研究 - 重新审视语言建模的竞争与消亡:数据驱动的验证.pdf

第1页 / 共13页
第2页 / 共13页
第3页 / 共13页
第4页 / 共13页
第5页 / 共13页
第6页 / 共13页
第7页 / 共13页
第8页 / 共13页
资料共13页,剩余部分请下载后查看
Revisit Language Modeling Competition and Extinction: A Data-Driven Validation
Abstract
Keywords
1. Introduction
1.1. Language Modeling
1.2. Abrams-Strogatz Model
1.3. Castelló Model
1.4. Mira Model
1.5. Questions to Be Answered
2. Method
Data Accumulation
3. Results
3.1. Abrams-Strogatz Model
3.2. Castelló Model
3.3. Mira Model
4. Discussions
Acknowledgements
Conflicts of Interest
References
Journal of Applied Mathematics and Physics, 2018, 6, 1558-1570 http://www.scirp.org/journal/jamp ISSN Online: 2327-4379 ISSN Print: 2327-4352 Revisit Language Modeling Competition and Extinction: A Data-Driven Validation Chosila Sutantawibul1, Pengcheng Xiao2*, Sarah Richie1, Daniela Fuentes-Rivero2 1Department of Physics, University of Evansville, Evansville, USA 2Department of Mathematics, University of Evansville, Evansville, USA How to cite this paper: Sutantawibul, C., Xiao, P.C., Richie, S. and Fuentes-Rivero, D. (2018) Revisit Language Modeling Com- petition and Extinction: A Data-Driven Validation. Journal of Applied Mathematics and Physics, 6, 1558-1570. https://doi.org/10.4236/jamp.2018.67132 Received: May 7, 2018 Accepted: July 27, 2018 Published: July 30, 2018 Copyright © 2018 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY 4.0). http://creativecommons.org/licenses/by/4.0/ Open Access Abstract Understanding language competition and extinction is an interdisciplinary challenge, and math models provide a tool for interpreting linguistic census data and possibly predict the language shift trend at the population scale. In this study, new data from previously examined areas were modeled, specifi- cally Catalan and Spanish in Catalonia, Spanish and English in Houston, Tex- as, Dutch and French in Brussels, Euskera and Spanish in Spain and French and English in Canada. Three mathematical models of the language competi- tion have been validated. The first is the Abrams-Strogatz model, which treats populations as having two monolingual groups. The second is the Castelló model, which considers bilingual speakers. The third is the Mira model, which considers language competition when the two languages have high similari- ties. It was found that the some of the data matched Abrams-Strogatz original model, but some divergences could still be addressed. It was also found that the Mira model needs some improvement in how it treats the differences be- tween languages. Keywords Language Competition Model, Parameter Estimation, Nonlinear Regression 1. Introduction Throughout history, languages have been significantly morphed or have gone extinct. It is estimated that 90% of the languages that exist today are expected to be extinct within the next generation [1]. This is due to a multitude of reasons such as empires conquering regions and coercing the inhabitants to speak their language and globalization in which native speakers must learn the languages of their neighbors. This leads to bilingual speakers and in some cases death of a language entirely [2] [3] [4]. 1558 Journal of Applied Mathematics and Physics DOI: 10.4236/jamp.2018.67132 Jul. 30, 2018
C. Sutantawibul et al. 1.1. Language Modeling Many mathematical models have been proposed to describe the dynamics of competition between two languages in a given region. There are two primary type of models describing language competition: microscopic and macroscopic. Macroscopic models of language competition treat the population as homogeneous (all members are the same, and evenly dispersed in an area) and fully connected (all members interact with other members) [5]. These macroscopic models are usually described by differential equations. Microscopic models treat individual persons as nodes in a network, allowing each node to be connected to a certain number of other nodes, as well as individual transitioning probabilities. In this paper, we focus on three macroscopic version of language competition models: Abrams-Strogatz Model, Castelló Model, and Mira Model. 1.2. Abrams-Strogatz Model Abrams and Strogatz proposed one of the first models to describe language competition using statistical physics and complex systems, which fueled other models of similar ideas to be published as well [6]. Abrams and Strogatz proposed a model, Abrams-Strogatz (AS) model, that describes language competition similar to the Lotka-Volterra predator-prey model, except both languages act as predator and prey to each other, as one speaker could switch from one language to another and vice versa. It assumes that the population is homogeneous and fully connected, which may not represent rural areas with sparse population or geographical separation, but could describe densely populated areas like cities. This model is described by the differential equations [7]: x d t d y d t d = = yP x s , yx ( xP x s , xy ( ) − P x s , xy ( ) ) − P x s , yx ( ) (1) ,x y are the fraction of the population speaking languages x and y where respectively, which means that the sum of the two fractions should equal one. xyP is the probability that an individual would switch from speaking language x to language y . This probability is defined by: P yx P xy a = sx ( 1 = − (2) )( 1 s − a x ) , where a is the volatility of a language, or how easy it is for an individual to switch over to the other language, and s is the prestige of the language, which is how attractive a language is to switch to. These two parameters are acquired by fitting this model to the data of population speaking a specific language in an area. Equation (1) could be viewed as rate equations, where the change in population of language x is simply the population of language y times the probability of 1559 Journal of Applied Mathematics and Physics DOI: 10.4236/jamp.2018.67132
C. Sutantawibul et al. people speaking y to change to x (people speaking y changing to speak x ) minus the population of language x times the probability of changing from x to y (people speaking x changing to y ). This model considers the speakers of each language to be strictly monolingual. At high volatility ( 1a > ), the few stable state (fraction of population speaking one language and the other no longer changes) of this model are when the entire s ≠ population speaks one language while the other dies ( ) and when both 0.5 languages have the same amount of speakers ( ). Since the condition for stability where both languages survive is so precise, the AS model almost always predicts that one language will eventually go extinct while the rest of the population adapts to the other language. 0.5 s = 1.3. Castelló Model Inspired by the original proposal of Wang and Minett [8], Castelló, et al.’s model extends the AS model by considering a third possible state which the population could be in, which is bilingual. This allows the population to change from speaking the only language x to speaking both languages b to speaking the only language y , and vice versa. This model also assumes homogeneous and fully connected population. The presence of a third intermediate state slows down the process of language extinction, but still does not indefinitely prevent it [9]. The differential equations that describe this model are: x d t d y d t d b d t d = yP YX + bP BX − ( x P XY + P XB ) = xP XY + bP BY − ( y P YX + P YB = xP XB + yP YB − ( b P BX + P BY ) ) (3) Again, these equations are simply rate equations, with the probabilities: ) s xy a P XB P YB P BX P BY ( 1 = − a syx = ( s 1 = ( 1 = − x − − )( s 1 )( y 1 x − − (4) a − y ) y )( 1 − a x ) Qualitative and quantitative analyses were both explored on complex networks and two-dimensional square lattices, and details in Ref. [10]. Castelló, et al. found that there exists a transition from one language dominance state to language coexistence state, and maintaining the coexistence state is very challenging under the bilinguals situation. The parameters in this model are also acquired by fitting the model to data. 1.4. Mira Model Mira, et al.’s model is also an extension of the AS model. This model adds to the AS model by 1) introducing bilingual speakers, and 2) introducing an extra 1560 Journal of Applied Mathematics and Physics DOI: 10.4236/jamp.2018.67132
C. Sutantawibul et al. factor that describes the similarity between the two languages, k, a , s, and k are all acquired by fitting the model to the data as well. Mira talks about the possibility of calculating k based on the similarity of the language, such as words, grammar, and structure. Mira had k = 1 to be the situation where the languages are identical and k = 0 to be where the languages are entirely different. The process of calculating can be very complicated and has yet to be developed [11]. The differential equation that describes Mira’s model is the same as Equation (3), but the transition probabilities are different, as the transition probabilities must contain the k value [12]: P XB P XY P YB P YX ( ks 1 − Y ) ( k s 1 − Y ( y ks 1 − ) ( k s 1 (5) a ) ( 1 ) a ( 1 X − y ) a x ) = = = = − − x X a Mira’s work focus on the time evolution of two coexisting languages (Castillian Spanish and Galician) under the framework of AS model. It claims that if the languages in the competition are similar enough, then a stable bilingual situation is possible. A sufficiently large value of k is needed for this particular situation [6] [12]. 1.5. Questions to Be Answered While the models thus far have found the volatility to be constant to fit their model, this was something that could still be examined with more data. Also, the prestige of other languages could be determined if other data sets were considered. The other question was how these models could be added upon or improved. Given the full range of areas where language competition exists, looking at more data sets would lend to more possibilities for improving these models, especially Mira and Castelló’s models. In this research work, we focus on the macroscopic model. Macroscopic modeling was also more frequently reported, so it would be easier to check if our results are accurate. The paper is organized as follows. Section 2 describes the method for the model validation. The first part is devoted to introduce the method we used for computing the parameters, while the second part describes the accumulated data from eight different regions. In Section 3, we carry on parameters fitting results based on the data from Section 2. The paper concludes with a discussion in Section 4. 2. Method All the models will be coded and fitted using MATLAB. The differential equations will be solved using ode 45. Ode 45 only has medium accuracy, so ode 113 is used when higher accuracy is needed. To find parameters, lsqcurvefit is used. lsqcurvefit uses the least squares regression analysis which computes the distance from the fitted curve to the data point and finds the parameters that 1561 Journal of Applied Mathematics and Physics DOI: 10.4236/jamp.2018.67132
C. Sutantawibul et al. DOI: 10.4236/jamp.2018.67132 allow for that distance to be the smallest. This method is different from Abrams and Strogatz’s method, as they wrote their own routines to numerically compute the differential equation as well as well as their own routine to compute the parameters. They used least absolute value regression, rather than least square regression, to compute their parameters, which may lead to discrepancies in their acquired parameters and our parameters [7]. Data Accumulation The dataset that we considered were of those that have direct language competition, which means that other languages are spoken in the area beside the main two account for less than 10% of the population. Most data were taken from country censuses, and some data manipulation was required. 1) Welsh-English This data (Table 1) [13] was chosen because it was one of the data that Abrams and Strogatz fit in their paper. The results from fitting this could be compared to the original paper. 2) Gaelic-English This is also one of the data (Table 2) [13] that Abrams and Strogatz fit in their paper. The results from fitting this could be compared to their work. 3) Euskera-Spanish (Spain) Euskera and Spanish are the two main languages spoken in northern Spain. This data (Table 3) only consider people who speak either Euskera, Spanish, or both. People who do not speak either are not accounted for. This data were taken from Sociolinguistic Maps Reports [14]. 4) French-English (Canada) People who speak neither French nor English were not accounted for in this dataset. Canadian government has policies that support their citizens to be bilingual, as well as preserve both languages. Data (Table 4) were taken from Statistics of Canada [15]. Table 1. Welsh-English data. Year 1901 1911 1921 1931 1951 1961 1971 1981 1991 2001 Welsh (%) 15.0 8.0 6.0 4.0 2.0 1.0 1.0 1.0 0.0 0.0 English (%) Bilingual (%) 50.0 57.0 63.0 63.0 71.0 74.0 79.0 81.0 81.0 79.0 35.0 35.0 31.0 33.0 27.0 25.0 20.0 18.0 19.0 21.0 1562 Journal of Applied Mathematics and Physics
C. Sutantawibul et al. Table 2. Gaelic-English data. Year 1891 1901 1911 1921 1931 1951 1961 1971 Gaelic (%) English (%) Bilingual (%) 5.2 2.2 9.3 3.8 1.5 0.1 0.1 0.0 27.6 32.1 41.3 47.8 56.0 75.7 82.7 86.0 67.2 65.7 57.7 51.9 43.9 24.3 17.3 14.0 Table 3. Euskera-Spanish data. Year 1991 2001 2006 2011 2016 Euskera (%) Spanish (%) Bilingual (%) 10.0 13.5 12.5 12.7 13.4 84.5 78.2 81.4 80.0 79.5 5.5 8.3 6.1 7.3 7.1 Table 4. French-English data. Year 1996 2001 2006 2011 2016 French (%) English (%) Bilingual (%) 68.2 68.6 68.8 69.4 69.6 14.5 13.5 13.5 12.8 12.15 17.3 17.9 17.7 17.8 18.2 5) French-English (Montreal) Since the models assume even density within the population, we decided to also look at Montreal, which is a fairly dense city. Values from this dataset could be compared to values calculated from all of Canada. Data (Table 5) were also taken from Statistics of Canada [15]. 6) Spanish-English (Houston) We looked at English and Spanish spoken in Houston, Texas. The data (Table 6) were taken from the American census [16]. 7) Catalan-Spanish Catalan and Spanish are very closely related, such that if a person speaks Spanish, they will be able to understand someone else speaking Catalan. We decided to choose this dataset specifically to use in Mira model, where there is a parameter for the similarity between two languages. This may be challenging as 1563 Journal of Applied Mathematics and Physics DOI: 10.4236/jamp.2018.67132
C. Sutantawibul et al. Table 5. French-English data. Year 1996 2001 2006 2011 2017 French (%) English (%) Bilingual 8.7 7.7 7.5 7.6 7.2 40.6 38.5 39.8 37.7 36.9 50.7 53.8 52.8 54.8 55.9 Table 6. Spanish-English data. Year 1970 1980 1990 2000 2014 Spanish (%) English (%) 2.6 13.0 30.0 34.0 38.0 97.4 83.0 70.0 76.0 54.0 the census only goes back to 2003. This data (Table 7) were taken from Language Use of the Population of Catalonia [17]. 8) French-Dutch(Brussels) The French and Dutch data spoken in Brussels, Belgium. This dataset (Table 8) [18] may not be very accurate, as the census only indicates knowledge of each language, and not if a person is bilingual or not. 3. Results 3.1. Abrams-Strogatz Model Table 9 summarizes the fitted parameters of the different language competitions using the AS model. The second s value was calculated by subtracting the first s value from 1. Most of the s values for the first language are in the mid-range 6s≤ ≤ ) except for the competition between Spanish and Euskera in Spain, ( 4 . The a values acquired were where 1a ≤ ), since Abrams and Strogatz got a values that were unexpectedly low ( close to 1.33. This could be caused by the fact that we used a different fitting routine than Abrams and Strogatz. 0.7538 s Euskera = 0.2462 s Spanish = and Since the initial values for each parameter were randomized, which could affect the outcome of the parameters. This happened in French/English (Canada), French/English (Montreal), French/Dutch, Spanish/English, and Spanish/Euskera. Parameters calculated for these datasets turn out to be entirely different depending on the initial value for the parameter. This behavior does not show in datasets Welsh/English and Spanish/English. This is because the two datasets show the population increasing/decreasing in the rapid growth/decay part of the curve, while the others did not show large change in a fraction of the 1564 Journal of Applied Mathematics and Physics DOI: 10.4236/jamp.2018.67132
C. Sutantawibul et al. Table 7. Catalan-Spanish data. Year 2003 2008 2013 Spanish (%) Catalan (%) Bilingual (%) 46.0 35.6 36.3 49.0 52.4 56.6 4.7 12.0 6.8 Table 8. French-Dutch data. Year 1842 1846 1866 1880 1890 1900 1910 1920 1930 1947 French (%) Dutch (%) 37.6 28.4 20.0 25.0 20.1 23.0 16.4 8.2 12.0 9.6 60.8 60.3 39.1 26.4 23.0 19.7 26.7 32.8 33.6 35.3 Table 9. Parameters of different language competitions using the Abrams-Strogatz model. Languages French/English (Canada) French/English (Montreal) French/Dutch Gaelic/English Spanish/English Spanish/Euskera Welsh/English s 0.5959 0.5754 0.4663 0.4828 0.4832 0.7538 0.4885 s 0.4041 0.4246 0.5337 0.5172 0.5168 0.2462 0.5115 a 1.5110 0.8831 0.8537 1.0159 0.8439 0.1850 0.9817 population over the years, or the dataset only contained data for a short period. The determination of a and s depends heavily on the shape and length of the rapid growth/decay region of the graphs, so without sufficient data in that region, the values of a and s could vary depending on what initial value was given to lsqcurvefit. This problem applies to all three models. These parameters calculated were also used to predict the outcome of the competition between each language. The AS model expectedly predicts that one language will disappear except for French/Dutch in Brussels, and Spanish/English in Houston. For the case of Brussels, this result could be from the fact that the data itself was faulty, because the census was not consistent, and the data did not show a steady growth/decay like the model expects. 1565 Journal of Applied Mathematics and Physics DOI: 10.4236/jamp.2018.67132
分享到:
收藏