to forced alignment : English-trained algorithms and their use on Australian languages

Forced alignment automatically aligns audio recordings of spoken language with transcripts at the segment level, greatly reducing the time required to prepare data for phonetic analysis. However, existing algorithms are mostly trained on a few welldocumented languages. We test the performance of three algorithms against manually aligned data. For at least some tasks, unsupervised alignment (either based on English or trained from a small corpus) is sufficiently reliable for it to be used on legacy data for low-resource languages. Descriptive phonetic work on vowel inventories and prosody can be accurately captured by automatic alignment with minimal training data. Consonants provided significantly more challenges for forced alignment.

1. Introduction.In order to conduct phonetic analysis, the alignment of an audio recording with its transcript at the segment level is necessary.The technology known as forced alignment (FA) is the use of computer algorithms to accomplish this task.Without the use of forced alignment, the manual segmentation and alignment of a transcript with a sound file for phonetic analysis is often prohibitively time and labor intensive.The use of digital recording technologies has made this issue more pronounced by increasing the amount of data available and thus the amount of time needed for manual alignment.In these situations, existing FA algorithms are very helpful, however most are trained on only a small number of well-documented and highly-resourced languages (Lin  et al. 2005; Yuan and Liberman 2008).This situation presents a challenge to researchers working on under-resourced and endangered languages, because there are often no existing language models for FA algorithms.Furthermore, data from under-resourced and endangered languages may be legacy data recorded on analog media that is overlooked in favor of digital recordings that are widely available and easier to work with, further exacerbating the digital divide.
Another complicating feature of many FA algorithms is that the amount of data required to train entirely new language models accurately is often prohibitive.For endangered languages with small corpora of legacy data, this is simply not possible.In order to conduct phonetic analyses of such languages, a solution that circumvents the limitations of manual alignment while working with the available FA algorithms is necessary.For this reason, we investigate whether existing (pre-trained) alignment algorithms are in fact usable for languages without large corpora and financial resources.
The language used for our test is Yidiny, a Pama-Nyungan language from the Cairns Rainforest region of Australia's Cape York Peninsula.Yidiny's closest relative is Djabugay (Patz 1991)  and to our knowledge currently has no fluent speakers, which limits the possibility of adding to the corpus gathered by R.M.W. Dixon approximately 50 years ago.Researchers such as Dixon (1977b)  have done some work on its sound system, but much important analysis remains to be done.Yidiny, as with most Australian languages, certainly qualifies as highly endangered and under-resourced, and as such makes an ideal candidate for the consideration of alternative documentation methods.
For our forced alignment algorithms, we use three models.Two of these are trained on English: namely P2FA (Evanini et al. 2009) and DARLA (Reddy and Stanford 2015).The third algorithm, MFA (the Montreal Forced Aligner), is not trained on English, but allows for training on small corpora such as Yidiny's (Povey et al. 2011). 1 After training MFA on Yidiny, we use all three aligners on the same corpus and then compare the results of these algorithms with manuallycorrected data as our gold standard.In doing so, we assess how accurately FA algorithms capture the alignment of Yidiny segments for the purpose of acoustic phonetic description.

Methods.
2.1 DATA SOURCE.Though there are no longer fluent speakers of Yidiny to our knowledge, a body of available data makes further linguistic analysis possible.The Yidiny materials used in this project come from a group of eight recordings and their associated transcriptions made by Dixon in the late 1960s and deposited at the Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS).The recordings comprise narratives from two different speakers, Tilly Fuller and Dick Moses, and range in duration from about 4 to 8 minutes.In total, we use about 45 minutes of speech.The speakers were both born at the end of the 1800s and were fluent speakers of the language.It should be noted that Dick Moses spoke the coastal dialect while Tilly Fuller spoke the inland (tablelands) dialect.While this is not the complete extent of audio materials recorded for Yidiny, it is a substantial part of the publicly available narrative corpus.
2.2 DATA PREPARATION.We began by creating a preliminary ARPABET pronunciation dictionary for Yidiny. 2 We then used it to align the transcripts, and then corrected the alignments manually in Praat.We created customized ARPABET pronunciation dictionaries to introduce multiple test conditions, which we then automatically aligned using P2FA and MFA.The manually corrected alignments became the basis for comparison to the various conditions of automatic alignment.Further discussion of the workflow and possible use cases are given in Section 4 below. 3ecause Yidiny orthography is surface phonemic, the transcripts correspond fairly closely to the surface pronunciation, including allomorphic variation (e.g.alternations in phonemic vowel length).This fact made the transcription of most segments into ARPABET straightforward.However, because the transcriptions and orthography do not take into consideration allophony, and ARPABET is limited to English phonemes, some segments were more difficult to map than others.This then raises the question, where multiple ARPABET -Yidiny mappings are possiblewhether such choices affect the accuracy of automatic alignment.This question provides the basis for our different conditions.Furthermore, the orthography corresponds more closely to Dick Moses's dialect than Tilly Fuller's. 4idiny's syllable structure is primarily CV(C) with a few consonant clusters occurring word medially (Nash 1979; Dixon 1977a).Yidiny's vowel system distinguishes three vowel qualities: /i/, /a/, and /u/ with phonemic length distinctions for all three (/i:/, /a:/, and /u:/).Because P2FA is primarily concerned with examining consonant-vowel transitions, we were not not concerned with distinguishing in the ARPABET pronunciation dictionary between stressed and unstressed vowels.We coded each vowel as having primary stress (indicated by a 1 following the ARPABET segment).However, we did code for phonemic length distinctions.Other segments present challenges for the Yidiny pronunciation dictionary.The two rhotics, /ô/ and /r/, exhibit allophonic variation and neutralization, with both being realized as taps in certain contexts.Because the trill is more commonly realized as a tap than the approximant, we chose to represent this in some conditions as R and in others as D, while the approximant was represented as R in all conditions.Because Yidiny shows no phonemic distinction between voiced and voiceless stops, we represented the stops /b/, /d/, /é/, and /g/ as voiced B, D, JH, and G in the Voiced condition, that P2FA is better than MFA, where the alignments were created independently.but voiceless P, T, CH, and K in all others.This decision meant that there was some overlap in the Voiced condition between /r/ and /d/, because both were coded as D. 5 Of particular challenge was the palatal nasal stop.We represented this segment in three different ways: as N, as Y, or as a cluster of N followed by Y.These mappings are summarized in Table 2.We also chose the Voiceless condition as the basis for our manually corrected alignment.
While manually aligning the texts, we encountered several issues.The first problem was the presence of non-speech sounds in the recordings.These sounds included birdsong, laughter, hesitations on the part of the speaker, and strong winds (many recordings were made outside).
Where it became obvious that these sounds had interfered significantly with the accuracy of the automatic transcription, we did not use the affected portions of the files. 6We removed these portions from the transcriptions.
Furthermore, Dixon's transcription did not always map to the audio in the recording.In the case of the narratives from Tilly Fuller, this was a result of differences in dialect.Because we were interested in the underlying phonemic representations, we chose to align according to the transcription in these instances, even where individual segments were not immediately present at a cursory investigation.In several other instances, the transcript did not match the audio as a result of simply being incorrectly transcribed or including hesitations, stuttering, or other sounds (e.g.backchanneling) from the speaker.In these cases the transcripts were manually corrected to match the audio file.
The alignment for DARLA followed a different method.Because DARLA does not use an ARPABET pronunciation dictionary, we did not have different conditions from ARPABET transcription decisions.DARLA allows alignment in two different ways.The first way requires a text grid file that already has utterance boundaries notated.The second uses a a plain text transcript with no boundaries marked.We used the second method to align our transcripts and recordings.Because DARLA extrapolates segmentation from the orthography, the number and nature of segments detected by DARLA did not always correspond to the manually corrected and automatically aligned P2FA alignments.
The methods for the MFA alignment were similar to that used for P2FA.MFA uses an ARPABET-based pronouncing dictionary (we used the same file as for the P2FA alignment).
MFA requires the alignment audio to be in small chunks (of several seconds), while P2FA can align long audio files, working successfully on files of 10 minutes in duration or longer.We ran both P2FA and MFA using files segmented at the utterance level, with a 80 ms buffer before and after each segment.Because Darla works by manually uploading single files, we could not test Darla in the same way.Instead, we uploaded each text as its own audio file and transcript.
2.3 DATA PROCESSING.After generating our automatic alignments and completing the corrections for the Manual condition, we extracted various measures from all of the resulting TextGrids.We adapted a duration logger script by Christian DiCanio to extract measures from each TextGrid interval; we also used his corpus analytics script which takes measures that include segment and word durations, F0, intensity, and vowel formants.We wrote a script in R (R Core Team 2018) for post-processing and analysis of the resulting data.
Linear mixed effects models were used to compare different alignment algorithms and different conditions within P2FA, using the lmertest package in R (Kuznetsova et al. 2017).Fixed effects for algorithm comparison were speaker gender and alignment model, with a random effect of word.We used the vowels package (Kendall and Thomas 2018) to process data on the vowel spaces.

Results.
How 'good' is forced alignment used in this way?The answer depends on what sort of measurements the researcher needs.Certain types of questions are more robust to the error these automatic aligners introduce than others.Prosodic measurements are highly robust to aligner error, while consonant alignment and durations are less so.This section considers the accuracy of all alignment conditions in detecting prosodic factors (pitch maximum, pitch peak), vowel measurements, and consonant durations.
3.1 PROSODY.Projects looking at prosody and stress are likely to require accurate F0 measurements.Figure 1 shows the peak F0 measurements from each of the alignment conditions.DARLA differed significantly from manual alignment, but P2FA and MFA were not significantly different from the gold standard.Peak F0 was within 1-2Hz of the manual condition.
Figure 1 shows the location of F0 peak measurements in the word, and again little difference is found across alignment condition.DARLA results were significantly different than the manual in this case; other conditions do not differ from manual.Average peak locations were within 0.6%-1% of manual alignment.
Overall, these results suggest that studies of prosody would benefit greatly from the use of forced alignment, with little to no loss of accuracy.Such findings are particularly encouraging for work on Australian languages, where work on prosody is still at an early stage.
3.2 VOWEL SPACE.One way to measure the accuracy of vowel segmentation is to consider how accurate the vowel space is (i.e.first and second formants).Measurements of vowel formants are slightly less robust to aligner error than F0 measurements.Vowel means are shown in Figure 2  quite far off from the manual results. 7Vowel means from P2FA and MFA are within 6Hz of the manual F1 means, and within 20Hz of manual F2 means.Vowels whose means deviate more from manual alignment are those vowels which have the fewest tokens in the data set.
Research requiring accurate vowel space measurements may need more manual correction than is needed for prosody studies.However, preliminary forced alignment greatly speeds up the task.
3.3 CONSONANT DURATION.Figure 3 compares average duration results for each consonant across all alignment conditions.Automatic algorithms varied with respect to their accuracy in different groups of consonants.For example, the mean durations for the oral stop consonants show that DARLA and P2FA tended to pick out shorter stops than manual, while MFA trended long.For nasal stops, on the other hand, MFA trended short and P2FA long, and overall accuracy for all algorithms was improved.Glides and liquids showed greater variation, with DARLA giving long /l/ segments but being fairly accurate on other segments.MFA and P2FA trended long in their /y/durations.
Projects requiring an analysis of consonants, VOT, lengthening, etc. require accurate and consistent consonant segmentation.Consonant duration measurements were least robust to aligner error, and showed wide variation across different phonemes in Yidiny.This sort of measurement is also likely to be the most variable cross-linguistically, as the accuracy of these automated mod-els on consonant segmentation depends on the similarity of a language's consonant inventory to English.Therefore forced alignment should not be used without manual inspection and correction when this is the object of study.Thus, if the input to the forced alignment is an ELAN transcript file which is already segmented at the utterance level, it is straightforward to extract those intervals for use with the forced aligner (either P2FA or MFA).However, if the input to the project is a paper transcript and digitized file, unless the researcher wishes to manually align at the utterance level, P2FA affords an easier choice.Presegmentation is done from utterance-level segmentations (in ELAN or Praat).If in ELAN, the files should be exported to Praat TextGrids.A script9 is then used to extract the utterances with a boundary buffer of 80ms.It also saves the TextGrid interval as a text file.The files need to have the same filenames as each other, and these filenames cannot contain punctuation, which means that many collections will require additional preprocessing.
In order to run both MFA and P2FA, a pronouncing dictionary is required.We created this by concatenating all text files into a large text file and deleting duplicates, using the free text editor BBEdit.We then created a version of the file which transliterated the orthographic conventions into ARPABET characters (which is read by both MFA and P2FA).
Finally, the models need to be run.We did this with a shell script which called the relevant Python scripts that run the aligner on each sound file.This allowed us to avoid having to enter each filename manually.4.2 USE CASES.Our project is a particular use case for forced alignment; that is, testing forced alignment algorithms on archival data.In the course of completing this project, we had to make choices for our data that would be different if our aims were different.In this section we document some of the implications of our choices for different use cases.
Language documentation in progress Our first use case is a fieldworker who is working on language documentation and needs word-level alignment of transcripts in progress.They are already transcribing manually at the utterance level but wish to use word-level alignment in the documentation.In this case, our procedures for segmenting audio files at the utterance level would not work well (at least, not as we did it), because the segmentation would be subsequently unalignable with the full transcripts.That is, the fieldworker would have to reimport or realign the utterance level segments.Alternatively, they could align the full file using P2FA, or they could write a script for 'stitching' the utterance-based TextGrids back together into a single ELAN file.Given that our methods include the timestamp of the file in the filename, this would be straightforward.However, the fieldworker would probably want to modify the timestamp to add another decimal place or two (so the 'stitching' is more accurate).
Because the fieldworker in this case only needs word-level alignment, not segment-level alignment, the methods here are probably accurate enough for most cases.
Another use case for language documentation in progress is for a fieldworker with a small corpus who wishes to use forced-aligned data to make preliminary observations about the phonetics of the language.For example, perhaps they are unsure about the phonemic status of some transcriptions and would like to plot vowel tokens to see the extent to which they cluster.Perhaps they are unsure whether secondary stress exist in the language, or whether stress ghosting (Fletcher of the alignment.In our work so far, we have found that errors can be compounded, and this can affect multiple phrases.On the other hand, there is no correlation between the position in the file and the accuracy of alignment, in broad terms.Work is currently in progress to further evaluate the effects of file length on accuracy.and Butcher 2014) is a factor.In that case, the most important thing to do would be to set up a workflow that automates the process of adding new data.Because the end result is analytical data that feeds back into the documentation project, but does not require direct links between the alignments and the original transcripts, it is possible to use utterance level segmentations as are required for MFA.
This fieldworker should be aware that the results are likely to be accurate enough for impressionistic findings, but the alignments should be manually checked as much as possible.Note also that if the purpose of the phonetic research is to determine differences that crucially rely on segment length (like stress or gemination), alignments done automatically should be checked manually before relying on the results.
Archival research Another use case is where the linguist uses forced alignment to create a segment-aligned corpus for phonetic research.Here the methods outlined in this paper will probably be sufficient (with subsequent manual checking if duration results are required).However, for recordings made in the field, outside of the lab, considerable preprocessing may be required.The work of Johnson et al. (2018) makes clear that there is gain from removing extraneous noise that interferes with forced alignment.
Community research Another use case is when the fieldworker requires word-level segmentation to create talking dictionaries.This will work for words recorded in isolation, but probably not directly for words extracted from running speech.4.3 FORCED ALIGNMENT WITH SPEECH TO TEXT.Finally, a note is warranted about speech to text.Forced alignment takes speech data and text data, and aligns the two.Speech to text models take speech data and create transcripts from them.We imagine an ultimate workflow where a preliminary speech to text corpus is trained on manually transcribed data.Persephone (Adams et  al. 2018) is such a project.Data could then be automatically transcribed, corrected by the researcher, and then fed to a forced alignment algorithm for segment-level alignment (text to speech transcribes utterances but does not align at the segment level).Such a workflow could provide more data for language projects where untranscribed audio recordings exist.
5. Conclusions.For many languages without available forced alignment algorithms, data for phonetic analysis exists but is underutilized due to the prohibitively time-intensive manual alignment process.For endangered languages, such as those of Australia, this situation is further complicated by the presence of small corpora of legacy data in the form of hand-transcribed audio tape recordings (Austin 2013).This creates both a need and opportunity to leverage new technology for the documentation of these languages.One possible way, determining the accuracy of English-trained forced alignment algorithms on non-English language data, has great potential for elevating the quality and rigor of phonetic work on low-resource and under-documented languages, especially those for which there are legacy recordings but no contemporary speakers who would be available to provide training data for entirely new FA models.Our work with Yidiny shows promising results for these languages, implying that for at least some tasks, unsupervised alignment (either based on English or trained from a small corpus) is sufficiently reliable for it to be used on legacy data for low-resource languages, whether endangered or not.In particular, descriptive phonetic work on vowel inventories and prosody can be accurately captured by automatic alignment with minimal training data.The novel use of this technology on under-resourced languages raises new possibilities for more detailed language documentation and for including more languages and data in comparative work with phonetics, phonology, and sound change.

Figure 1 :
Figure1: Density plots showing, across all alignment algorithms, the location of pitch peak in the word (left) and the maximum pitch values across all words (right).DARLA distributions are significantly different from manual on both counts, but P2FA and MFA (Kaldi) are not.

Figure 2 :Figure 3 :
Figure 2: Space of vowel means for Yidiny speakers Tilly Fuller (left) and Dick Moses (right) across all alignment conditions.DARLA results were poor; P2FA and MFA (Kaldi) results within 6Hz for F1 and within 20Hz for F2.