Automatic phonetic classification of vocalic allophones in Tol

The aim of the present study involving automatic phonetic classification of /e/ and /u/ tokens in Tol is two-fold: first, I test existing claims about allophonic variation within these vowel classes, and second, I investigate allophonic variation within these vowel classes that has yet to be documented. The acoustic phonetic classifications derived in the present study contribute to a more detailed understanding of the allophonic systems operating within the Tol language. Operationalizing machine learning algorithms to investigate under-resourced, indigenous languages has the potential to provide detailed insights into the acoustic phonetic dynamics of a diverse range of vocalic systems.

/c h / and as [u] preceding other syllabic codas. The current study aimed to analyze acoustic data from a corpus of Tol speech (Salesky et al. 2020) to adjudicate these two allophonic claims.

Methods.
The corpus used in this study was composed of speech samples from spoken bible readings by Tol speakers that were gathered as part of the Vox Clamantis project (Salesky et al. 2020). The vowel tokens analyzed in the current study had been previously segmented by Salesky et al. (2020). I operationalized k-means clustering (Huang 1998, Steinley 2006, Kaufman and Rousseeuw 2009), a machine learning algorithm that locates a pre-specified number of clusters in a data set, to automatically cluster pre-coda tokens of /e/ (n = 27,402) and /u/ (n = 18,983) from these recordings in order to determine whether allophonic splits were identifiable via acoustic measurements. For this analysis, I measured the first two formants of each vowel token at 25%, 50%, and 75% duration: all six of these measurements per vowel were ultimately submitted to the kmeans() function, such that the algorithm had a reasonable amount of acoustic data to work with from several timepoints throughout each vowel token.
2.1. K-MEANS CLUSTERING ALGORITHM. I implemented this machine learning algorithm in R software (R Core Team 2020) with the kmeans() function from the stats package. Before running the kmeans() function for the measurements I had made for each vowel class, I first went about calculating the appropriate number of clusters for the algorithm to look for in the data set for each vowel class. Running the k-means clustering algorithm requires the user to manually select the number of clusters for the algorithm to look for in the data, so computing the optimal number of clusters for each vowel class occurred first chronologically. Although several methods for identifying an optimal number of clusters exist (Pham, Dimov, and Nguyen 2005;Chiang and Mirkin 2010;Kodinariya and Makwana 2013), I selected the silhouette method due to its relative popularity and ease of implementation in R software.
I used the silhouette() function from the cluster package in R (Maechler, Rousseeuw, Struyf, Hubert and Hornik 2019) to calculate a silhouette coefficient for each set of vowel data. A silhouette coefficient is fundamentally a measure of how similar each data point is to other data points in its own cluster versus to data points in other clusters (Rousseeuw 1987). This coefficient ranges from -1 to +1, with higher coefficient values corresponding to points being more similar (i.e., having lower Euclidean distances) to other points in their own cluster. To calculate this value for each vowel's acoustic measurements, the silhouette() function completes these steps: (1) Compute the average distance of a given point i to all other points in point i's cluster. [a(i)] (2) Compute the average distance of that given point i to all other points in the nearest neighboring (4) Repeat steps 1 through 3 for every point in the data set, then average all of the silhouette coefficients to arrive at the overall silhouette coefficient for that data set.
After computing a series of silhouette coefficients for various numbers of clusters per vowel class data set, I then selected the number of clusters that corresponded to the highest positive silhouette coefficient. More detailed information about this process appears later in the results section, where these comparisons are represented in several visualizations. Once I had located the optimal silhouette coefficient for each of the two vowel classes, I ran a k-means algorithm in R with the kmeans() function in the stats package. This function took two arguments: the six acoustic measurements associated with each vowel and a numeric value for k, which is the number of clusters associated with the optimal silhouette coefficient from those analyses described previously. The k-means algorithm completes these steps: (1) Randomly choose centroids for k number of clusters.
(2) Calculate the distance from each data point to each randomly chosen centroid.
(3) Assign each data point to the closest centroid (i.e., cluster) in terms of Euclidean distance.
(4) Calculate a new centroid for each cluster by averaging the mean locations of all points assigned to that cluster. (5) Repeat steps 2 through 4 until the cluster centroids stop moving.

Results.
First, I calculated the appropriate number of clusters for each of the two sets of vowel tokens using the silhouette method. Figures 1 and 2 show the results of these silhouette analyses for /e/ and /u/ tokens, respectively. Both silhouette method results indicated that two clusters was the optimal value for k in the k-means algorithm, suggesting that this number of clusters maximized each data point's similarity to those other data points assigned to its own cluster.  The results of the k-means algorithms, both of which were set to locate two clusters as per the results of the silhouette method analysis, are shown in Figures 3 and 4: for the sake of plot readability, the dimensions shown on the axes are mean F1 and F2 values per following environment at 25% duration.  For the /e/ tokens, pre-/ŋ/ productions were distinct from productions in other following environments both in terms of F1 and F2 at 25% duration. This finding matches the impressionistic reports offered by Fleming and Dennis (1977) and Holt (1999). The results of the current study provide supplementary acoustic evidence in favor of this allophonic distinction within the /e/ class in the Tol language. For the /u/ tokens, pre-/j/ productions were distinct from productions in other following environments primarily in terms of F2 at 25% duration. The frontness of these pre-/j/ tokens suggests that they are more [ʊ]-like than the tokens in other following environments. While Fleming and Dennis (1977) and Holt (1999) impressionistically observed [ʊ]-like productions for the /u/ vowel class in pre-/s/ and pre-/c h / environments, the current study shows acoustic evidence to support that [ʊ]-like productions are most common in pre-/j/ environments.

Discussion and conclusions.
The findings reported in the current study partially support and partially challenge previous accounts of vocalic allophony in the /e/ and /u/ vowel classes in Tol. My analysis of /e/ tokens supported existing impressionistic descriptions of /e/ allophony in Tol, but my analysis of /u/ tokens showed that pre-/j/ productions were consistently closer to [ʊ] than pre-/s/ or pre-/c h / productions were.
For the /u/ productions in pre-/j/ environments, it is possible that there is a phonological motivation having to do with natural classes at work: what complicates this analysis is that there appears to be one allophone in pre-/j/ environments and another allophone in pre-/w/ environments. One possible explanation for the [ʊ]-like allophone appearing only before /j/ is coarticulation, but it is not yet clear how the /uj/ sequence specifically differs from the /uw/ sequence in Tol such that one would trigger an allophonic change due to coarticulation and the other would not. Another possibility is that /j/, which has been previously described by Fleming and Dennis (1977) as a syllabic coda that can occur after /u/, is actually functioning as a sort of vowel-glide sequence whose internal structure is distinct in some important way from the internal structure of vowel-consonant sequences. The current analysis does not claim to adjudicate among these possibilities due to lack of relevant data at this time, but it is certainly the case that there are several plausible motivations for this allophonic split.
The primary aim of the current study was to examine whether acoustic evidence could be located to support or dispute impressionistic descriptions of vocalic allophony in the Tol language for the /e/ and /u/ vowel classes. My results for /e/ support existing descriptions of /e/ allophony but my results for /u/ did not support existing descriptions of /u/ allophony. Because Tol is an under-resourced language with limited acoustic documentation, the current study offers a more detailed perspective on the acoustic operations of its allophonic systems in vowels: operationalizing machine learning techniques to investigate the acoustic dynamics of under-studied vocalic systems has the capacity to expand knowledge about allophony and its triggers cross-linguistically. In particular, k-means clustering is a very useful tool for exploring allophonic patterns in acoustic space, and in the future this kind of machine learning algorithm has potential to be used for a wider variety of clustering tasks, including phoneme identification and tone systems.