Language Variation Suite : A theoretical and methodological contribution for linguistic data analysis

In recent years there has been growing interest in quantitative methods for analyzing linguistic data. Advanced multifactorial statistical analyses, such as inferential trees and mixed-effects logistic regression models, have become more accessible for linguistic research as a result of the availability of an open source programming environment provided by the statistical software R. In the present paper, we introduce a novel toolkit, Language Variation Suite, a software program that offers a friendly environment for conducting quantitative analyses. We demonstrate how theory built on traditional monofactorial analysis can be extended to macro and micro multifactorial approaches allowing for a deeper understanding of language variation. The focus of the analysis is based on intervocalic /d/ deletion in Spanish from the Diachronic Study of the Speech of Caracas 1987 and 2004-2010. In contrast to traditional methodological approaches we have treated intervocalic /d/ as a continuous dependent variable according to the intensity ratio measurements obtained. Furthermore, we have integrated various syntactic, phonetic and sociolinguistic factors. Non-parametric and fixed-effects regression models revealed that overall age (younger speakers), sex (male speakers), phonetic context (low vowels), token frequency and morphosyntactic category (past participles) have a significant effect on the lenition of intervocalic /d/. In contrast, the mixed-effects model selected only phonetic context, frequency and category, showing that individual speaker variation is higher than group variation.

techniques applied to sociolinguistics have become obvious.First, the nature of sociolinguistic observation deviates from the assumption of normally distributed, balanced and independent data.As Díaz-Campos and Dickinson state, "sociolinguistic studies are based on correlated data".Furthermore, traditional sociolinguistic tools1 are unable to handle continuous or multinomial variables.Finally, traditional tests do not capture individual-level or word-level variation (Johnson 2016).In fact, these statistical practices have already been abandoned in scientific fields in favor of new statistical methods, such as mixed regression models, conditional trees, and random tree analysis (Kidhardt 2015).Not only can these models measure individual and lexical variability, but they can also handle skewed and small-size corpora, which is often the case with linguistic data.While new methods have recently gained a lot of attention in sociolinguistic literature, their use remains limited, as they require some programming skills, which often presents technological challenges for researchers.There is also a need to test new tools "for their comparability and reliability in the study of language, variation and change" (Díaz-Campos and Dickinson 2017).
In this paper, we propose to address these issues by introducing a user-friendly application-Language Variation Suite-that implements state-of-the-art statistical methods.In addition to mixed effects models and regression tree analysis, this toolkit allows researchers to incorporate continuous and multinomial variables.Furthermore, we examine the weakening of intervocalic /d/ in Spanish, a gradable phonological phenomenon that has been traditionally treated in sociolinguistic studies as a categorical variable.We show that the conceptualization of sociophonological variables as continuous provides greater precision and better understanding of sound change, as it incorporates accurate acoustical criteria that take into account the gradient nature of phonological variables.
The remainder of this paper is organized as follows: Section 2 reviews sociophonetic variables and discusses traditional and current practices of sociolinguistic data analysis.Section 3 describes our corpus and methodology.Section 4 introduces a novel toolkit for sociolinguistic data analysis, Language Variation Suite.In section 5 we present results and discussion.Section 6 draws conclusions and provides future directions for our research.

Sociophonetic variable.
2.1.INTERVOCALIC /D/.Lenition of intervocalic /d/ is one of the most studied phenomena in the dialectological and sociolinguistic literature dedicated to Spanish.The realization of intervocalic /d/ as approximant [δ], e.g.lado [laδo] 'side', is a systematic articulatory reductive process (Navarro-Tomás 1999[1918]).On the other hand, deletion of intervocalic /d/ is one of the most extreme manifestations of reduction, as in cantado ~ cantao 'sung' (Hualde et al. 2011).In fact, cases of /d/ deletion have been documented since the 17th century (Zamora 1970;Lapesa 1981) and have been found abundantly in many varieties of Spanish (Spain: Navarro Tomás 1999 [1918]; Latin America: Henriquez Ureña 1921; Venezuela: Lipski 1994).In dialectological and sociolinguistic studies, this phenomenon has been traditionally conceptualized as a discrete binary phenomenon based on auditory analysis, namely the presence and absence of /d/.Subsequent quantitative studies have shown that the realization of intervocalic /d/ is influenced by linguistic and extra-linguistic factors.That is, the choice between d-deletion and d-retention is systematic and sociolinguistically predictable.For example, Cedergren (1973) found that in the Spanish of Panama d-deletion is favored in informal styles by women, older speakers and lower socioeconomic participants from rural areas, while Padilla (1996) showed that in Las Palmas de Gran Canaria d-deletion occurs mostly in the past participle with -ado and is favored by male speakers.Similarly, D' Introno and Sosa (1986) found that male speakers favor deletion in the Venezuelan variety of Spanish, whereas d-retention is favored by high and middle socioeconomic groups.Furthermore, Díaz-Campos and Gradoville (2011) revealed the frequency effect on d-deletion in the same variety.Their results show that high lexical frequency and type frequency predict higher deletion rates.
In contrast to the traditional binary approach, acoustic studies have shown considerable variation in the realization of intervocalic /d/.In this view, the degree of /d/ lenition can vary from very close (consonant-like) to very open (vowel-like) realizations of the approximant [δ] (Carrasco 2008, Hualde et al. 2011).This degree is commonly measured by means of relative intensity: i) intensity difference-the difference between the lowest intensity point of the approximant and the highest intensity point of the following vowel (Eddington 2011, Simonet et al. 2012) or ii) intensity ratio-the ratio between the lowest intensity point of the approximant and the highest intensity point of the following vowel (Carasco et al. 2012).2As a result, lenition is conceptualized as a continuous variable.According to recent acoustic studies, more lenited or vowel-like realizations of /d/ occur in the following contexts: i) before a stressed vowel (Colantoni and Marinescu 2010), ii) in word-medial position (Eddingon 2011), iii) with higher frequency words (Eddington 2011) This continuous scale for intervocalic /d/ allows for greater precision, as it is based on more accurate acoustic measurements.
It should be noted that the previous accounts of /d/ lenition have the following limitations: i) many acoustic studies rely on a monofactorial analysis of the relation between the realization of intervocalic /d/ and one predictor, e.g.duration or vowel context, prosodic context, stress (Simonet et al. 2012, Limanni 2009, Torreira and Ernestus 2011, etc), ii) most studies examine apparent time variation, namely the comparison between speakers of different age groups during the same chronological time period and iii) most sociolinguistic multifactorial studies are based on auditory discrete analysis, which is affected by researchers' perception.

TRADITIONAL AND NOVEL PRACTICES IN SOCIOLINGUISTICS.
The foundation of traditional variable rule practices in the sociolinguistic field was introduced in Labov's classic study on copula deletion (1969).Labov observed that language variation is inherent and systemati,c in contrast to previous views on language variation that treated it as optional or free (Cedergren 1974:333).In this approach, language variation, denoted as a linguistic variable, represents "two or more ways of saying the same thing" (Labov 1972:271).Furthermore, each context is independent from other contexts and has a fixed effect, which is based on the presence or absence of a given feature (Cedergren and Sankoff 1974:335).This Variable Rule model enables researchers to incorporate the combination of sociolinguistic and linguistic environments in which the linguistic variable occurs (Labov 1969).This model was implemented in the first sociolinguistic statistical tool VARBRUL, which was replaced by an improved version, GoldVarb (Sankoff et al. 2005).For several decades, the variable rule program has been successfully employed in many sociolinguistic studies, allowing researchers to identify which sociolinguistic factors influence phonological variation.While this program helps analyze the multifactorial interplay of social and linguistic factors, the categories of the Variable Rule model must be discrete, and factor groups with 100% or 0% must be excluded.As Díaz-Campos and Dickinson (2017) point out, this design was a product of "linguistic theories at the time where linguistic features were conceived as [+/-]".Furthermore, the underlying assumption of logistic regression implemented in GoldVarb is independence of observation.Recently, it has been argued that linguistic variables are rarely independent and that "many potential predictors are in a nesting relationship with speaker or word" (Johnson 2016).Thus, to improve traditional variable rule analysis and allow for non-discrete continuous predictors, the mixed-effects model has been introduced into the sociolinguistic field.It has been shown that this new model "returns more accurate p-values compared to a fixed-effects model that ignores nesting" (Gorman and Johnson 2013:223).This model is available in many types of statistical software, such as PROC GENMOD in SAS, the glm package in R and Stata (Agresti 2007:67), and has also been implemented in a new sociolinguistic toolkit, Rbrul (Johnson 2009).Finally, there has also been growing interest in using visual statistical methods such as random forests and conditional inference trees to enhance the Variable Rule model and improve its limitations (Tagliamonte and Baayen 2012).Random forest and tree-based methods are referred to as non-parametric regression tests.Conditional inference trees (partykit package) estimate the distribution of a response (aka a dependent variable) by means of recursive partitioning (Hothorn and Zeileis 2015).In this approach, "the feature space is recursively split into regions containing observations with similar response values" (Strobl et al. 2009:324).This tree-based analysis has been successfully used for multivariate data exploration in many scientific fields.While such a non-parametric approach is relatively new in sociolinguistics, recent studies have shown that random forests "provide the closest fits to the data" (Tagliamonte and Baayen 2012:32) and that conditional trees help "visualize different combinations of factors (independent variables or fixed effects) and their significance" (Díaz-Campos and Dickinson 2017:4).These advanced practices enable researchers to handle imbalanced data, measure individual variation and rank variables according to their significance (Strobl et al. 2009); their implementation, however, requires some programming skills (e.g.R programming language) or access to statistical tools that are not always freely available.In addition, given the vast number of available statistical tests, the question has been raised as to how these current practices affect sociolinguistic studies and concerning their advantages and disadvantages for studies of language variation.In answering these questions, it is necessary to compare and contrast both approaches, traditional variable rule model and innovative models (Johnson 2009(Johnson , 2016;;Eddington 2010;Tagliamonte 2011Tagliamonte , 2012;;Díaz-Campos and Dickinson 2016).

Methodology.
3.1.CORPUS.The data used in this project comes from a diachronic corpus, Corpus histórico del habla caraqueña 1987 y 2004-2010 (CHHC'87/04-10) `Diachronic Study of the Speech of Caracas 1987and 2004-2010' (Bentivoglio and Sedano 1993, Bentivoglio and Malaver 2006).This corpus consists of one hundred sixty half-hour sociolinguistic interviews with audio recordings and transcripts, conducted with native speakers of Caracas.The current research focuses on a subset of thirty-two speakers who are equally divided among three age groups (20-34, 35-54, 55 and older), both genders and three socioeconomic groups (upper, middle, lower).For this study, we included only word-internal instances of intervocalic /d/ (e.g.ocupado 'busy', vida 'life').In addition, we included cases where the preceding or following vowel was a diphthong (e.g.cambiado 'changed', fastidiar 'to annoy').The total of 1031 tokens containing intervocalic /d/ was collected from this corpus.
3.2.ACOUSTIC ANALYSIS.Acoustic analysis was performed using PRAAT (Boersma 2001).We manually segmented sections of sound waves corresponding to intervocalic /d/ and its preceding and following vowels.The acoustic measurements for intervocalic /d/ were obtained by using the relative intensity ratio method described in Carrasco et al. (2012).This method requires two measurements from the intensity curve: the lowest intensity point of /d/ and the highest intensity point of a vowel.The intensity ratio is calculated by dividing the lowest intensity point of /d/ by the vowel's highest intensity point.PRAAT scripts are developed to extract the highest and the lowest points as well as to calculate intensity ratio formulas.3A sample script is illustrated in Figure 1.The obtained ratio provides a value between 1, a more vowel-like production, and 0, a more stop-like production.For example, Figure 2 demonstrates an instance of a more lenited /d/ in comunicado 'informed' (ratio = 0.98), and Figure 3   For the purpose of our investigation, the intensity ratio serves as the dependent continuous variable.The summary of our independent variables is illustrated in Individual participants and word tokens are also included as variables for mixed-effect regression analysis to measure variability between speakers and word-specific effects.The codified data is stored in CSV format (comma separated values), which makes it easy to manage and analyze data.In the next section, we will describe our new statistical tool for data analysis.

Language Variation Suite.
Previous sociolinguistic tools, such as GoldVarb and Rbrul, were designed to run on personal computers.As a result, they require installation and computer memory usage.That is, a particularly large dataset may need to run for several hours to perform analysis, depending on the user's hardware.Furthermore, all tests are available in such applications.For instance, while Rbrul carries out a mixed-effect regression analysis, it does not include conditional tree and random forest analyses.Recently, a new programming environment, R, has received attention in the sociolinguistic literature.As Tagliamonte (2011) points, "R is exponentially more powerful tool for statistical analysis than Goldvarb or Rbrul" (2011:168).R has already been used in psycholinguistics, and it has started gaining popularity for the analysis of linguistic data (Jenset 2010).However, it involves a steep learning curve and has no userfriendly interface (Tagliamonte 2011).
We propose a new tool, Language Variation Suite, created with the powerful statistical R package and designed with a user-friendly interface (see Figure 4).In addition, our program runs online and does not require installation or memory usage.Furthermore, this application carries out state-of-the-art statistical tests, e.g.conferential trees, cluster analysis and random forest, as well as graphical data visualization.7As a result, Language Variation Suite makes advanced statistical methods accessible to a broader audience, as its use does not require programming skills.Various statistical R packages are used in this program, e.g.mlogit, lme4, randomForest, wordcloud, ca, stats.The architecture of this tool consists of two components: i) a server script and ii) a user-interface definition.The server script includes codes for various functions and expressions, e.g.renderPlot or renderTable.The user-interface definition controls the html output of these functions and defines which functions require user input (interaction) and which functions return output.To illustrate this program, we provide samples of an R script and its output on the interface.Figure 5 demonstrates a function for selecting a statistical model.The user has to select the type of model, fixed or mixed, and the type of dependent variable, binary or continuous.On the left, we present a code for this function, and on the right, there is an actual html output on the interface.

Results.
5.1.DESCRIPTIVE STATISTICS.The overall distribution of intervocalic /d/ is illustrated in the density plot (see Figure 6).This plot shows a unimodal distribution, with its peak at 0.956. 8These results suggest that deletion is not the norm and that lenited variants are common in this speech community.
Figure 6.Kernel density plot for intervocalic /d/ distribution (intensity ratio) Looking at two chronological datasets separately, it is noticeable that there is a sharp peak in the 1987 dataset (see Figure 7), whereas the curve becomes more evenly distributed around its peak in the 2004/2010 dataset, as shown in Figure 8. 9 For more information about random forest analysis, see Tagliamonte and Baayen (2012).
According to these results, the most important predictor among social factors is Age, and Token Frequency is by far one of the most important predictors among linguistic factors.Sex, Period, Category, Preceding and Following Contexts also contribute significant effects in predicting intervocalic /d/ lenition.As Tagliamonte and Baayen (2012) point out, random forest allows for collinear variables (highly correlated factors) to be considered jointly.For example, our model includes the following variables: category, phonetic contexts and frequency.While not falling into the same type, these factors are nonetheless highly correlated.It is well known that -ado is a preferred context for /d/ deletion: -ado is a frequent past participial suffix and at the same time it is a common phonetic context for /d/ deletion.Based on the model ranking, the order of strength is frequency >category >preceding context >following context.
The second non-parametric method, namely conditional tree, is a single representation of recursive partitioning.While it is inferior to random forest ranking,10 the single tree makes it possible to visualize the partitioning of a dependent variable by independent factors.Following the methodology of Tagliamonte (2012:153)   According to our tree model, Age is the most important social factor, splitting speakers into two groups: 20-54 and 55+.Recall that our dependent variable is continuous and therefore, terminal nodes are represented by box-plots with mean value (dark solid line).The 20-54 old group is further differentiated by sex with male speakers, especially in 1987, using more lenited variants of intervocalic /d/.In 2004/2010 only younger male speakers (20-34) produce more lenited /d/.Socio-economic class is not selected as significant, which confirms the results from random forest analysis (see Figure 9a).Among our linguistic factors, only preceding context and token frequency are selected as significant.Preceding context is split between high vowels and low/mid vowels.While random forest identifies frequency as the most important predictor (see Figure 9b), the conditional tree suggests that frequency is the most important predictor for low/mid vowels.In addition, we see that more frequent tokens exhibit more lenited variants, which supports previous accounts on /d/ deletion (see Díaz-Campos and Gradoville 2011).
Finally, we will perform a parametric analysis, where we will compare fixed-effects and mixed-effects models.It should be noted that each model has its own advantages and disadvantages.As Johnson (2016) states, fixed-effects models ignore individual variation, which may lead to Type I Errors, where "a chance effect is mistaken for a real difference between the populations".In contrast, mixed-effects models are prone to Type II Errors: "if speaker variation is at a high level, we cannot discern small population effects without a large number of speakers" (Johnson 2016:22-23).In addition, we need to select the best model for each regression analysis.Language Variation Suite performs model comparison by using AIC (Aikake Information Criterion), BIC (Bayesian Information Criterion) and Anova with Likelihood Ratio Test.11Table 2 illustrates the results for the best fixed-effects model (p<2.515e-11)based on Anova and AIC criteria.This model includes the following independent factors: preceding and following contexts, token frequency, sex, age, period, morpho-syntactic category.According to the results, following phonetic context and token frequency exert a highly significant effect on lenited intervocalic /d/.As their coefficient estimates are positive (0.0197 and 0.0012, respectively), low vowels and frequent tokens favor more lenited variants.Other significant factors by order of significance are age group of 20-34 (p<0.01),following mid vowel (p<0.01),male speakers (p<0.01),past participle (p<0.01) and preceding low vowel (p<0.05).Overall variance in this model is 0.001 (1.405e-05+1.174e-04+8.857e-04).Tokens represent only 1.4% of variation (1.405e-05/0.001),whereas speakers' variation is 11.5% of the data variation.Significant factors are the following, in order of their significance: following low vowel, following mid vowel, token frequency, past participle and preceding low vowel.Our model did not select any sociolinguistic factors, demonstrating that random effects for speakers are stronger than fixed effects.However, we should keep in mind that the model may not detect small population effects considering the small size of speakers (Johnson 2009(Johnson , 2016)).In contrast, random effects for word variation are less strong (only 1.4%), and the fixed effect for token frequency remains very significant.Similarly, following context remains by far the most significant factor favoring lenited variants (p<0.000).Finally, past participles and low preceding vowels also influence /d/ lenition (p<0.01 and p<0.05, respectively).6. Discussion.The intensity ratio measurement reveals that intervocalic /d/ deletion is not the norm in the corpus of Caracas and that the lenited realization of intervocalic /d/ is more common in this speech community.In fact, the density distribution maintains its intensity peak at 0.95-0.96across time from 1987 until 2004/2010.To examine the role of linguistic and extralinguistic contexts on the lenition, we used Language Variation Suite, which implements stateof-the-art statistical methods.Its non-parametric tree-based analysis allowed us to interpret visually the role of independent factors.Furthermore, the comparison between fixed-and mixedeffects models provided a better understanding of group-and individual-level variation.First of all, in parametric and non-parametric tests, we found an effect of token frequency and following phonetic context: more frequent tokens and low vowel /a/ strongly favor more lenited realization of /d/.In addition, the grammatical category, namely past participle, appears to play a role in explaining the lenition process.These are consistent with the study by Díaz-Campos and Gradoville (2011), where frequency and -ado participles favor /d/ deletion.Concerning sociolinguistic factors, non-parametric tests and the fixed-effects regression model indicate a strong effect of age (younger speakers) and sex (male speakers) on lenited variants.In contrast, the mixed-effects model showed that individual variation in our corpus was higher than group variation (11.5%).As a result, none of social factors were selected.
Taken together, our comparative analyses show that by conceptualizing sociophonetic variable as continuous, we gain a better understanding of this phenomenon.In addition, advanced statistical practices offer a novel way to interpret the results of sociolinguistic multifactorial analysis.

Conclusion.
This research project contributes to the statistical analysis of socio-phonological variables.Following the methodology from recent acoustic studies, the present investigation uses intensity ratio to measure the degree of lenition.Furthermore, this study addresses questions concerning the statistical analysis of gradient phonological variables by contrasting traditional variable rule analysis with the current practices of using mixed-effects modeling and tree-based analysis.
One of the novel implementations of this project is the creation of an interactive sociolinguistic toolkit that implements state-of-the-art statistical methods-Language Variation Suite. 13he accessibility of the tool online and its user-friendly interface are two principal components that were missing from the previous sociolinguistic tools.In addition, the deployment of the tool on the Shiny server also increases its computational power: no longer beholden to the memory limitation of personal computers, statistical calculations can now run on a server.

Figure 2 .
Figure 2. Sound wave, spectrogram and intensity contour of a more lenited intervocalic /d/ in comunicado 'informed'

Figure 4 .
Figure 4. Language Variation Suite: On-line interface

Figure 5 .
Figure 5.A sample script written in Rstudio (left) for a statistical model's selection and its output as a ShinyApp (right).

Figure 7 .
Figure 7. Intensity ratio for the 1987 dataset and to avoid complex trees, we will look at social and linguistic factors separately.Social factors are shown in Figure10, and linguistic factors are illustrated in Figure11.It should be noted that factor groups are represented in a hierarchical order from top to bottom.In this model, the node numbers simply show the sequential labels from left to right, terminal nodes represent relative frequency of response, and p-values indicate the level of factor significance(Strobl et al. 2009).

Figure 10 .
Figure 10.Conditional inference tree with social factors

Table 1 :
Summary of dependent and independent variables

Table 2 :
Coefficients of a generalized linear fixed-effects model with an R 2 of 0.07564 Our second model, mixed-effects regression model, examines the effect of individual speaker and token variability.Table 4 presents random effects and Table 4 exhibits fixed effects.

Table 4 :
Fixed effects of a generalized linear mixed-effects model