Learning Nonlocal Phonotactics in a Strictly Piecewise Probabilistic Phonotactic Model

Phonotactic learning is a crucial aspect of phonological acquisition and has figured significantly in computational and theoretical research in phonology. However, one persistent challenge for this line of research is inducing non­local co­occurrence patterns (Hayes &Wilson, 2008; Gouskova & Gallagher, 2020). Most previous phonotactic learners locally evaluates the contiguous n items (local n­grams) as phonological constraints, especially the baseline Maximum Entropy (MaxEnt) learner (Hayes & Wilson, 2008). As the length n increases, the search space grows so quickly that it becomes intractable; their learner cannot efficiently detect co­occurrence patterns over arbitrary distances. For instance, instead of directly penalizing the nonlocal dependency of two sibilants, the learner can only inefficiently approximate *s. . . S by penalizing the enormous combinations of n­items e.g. trigram *soS, 5­gram *sopoS, .... Most subsequent works on MaxEnt learner generalize non­local phonotactics by searching localn­grams over postulated tiers/projections (Wilson & Gallagher, 2018). Gouskova & Gallagher (2020) further offered a method for inducing tiers from placeholder trigrams, however their learner is only shown to succeed on data in which the target phonotactics largely occur in local trigrams rather than nonlocal dependency at arbitrary distance. The current study challenges the local n­grams as the presumed hypothesis space of phonotactic models in MaxEnt approaches, and develops a probabilistic phonotactic learner based on the Strictly Piecewise class of subregular languages (Heinz, 2010). The implemented learner successfully learns both segmental and featural representations of Quechua, and correctly predicts the acceptability of nonce forms in Gouskova & Gallagher (2020).


Introduction
Phonotactic learning is a crucial aspect of phonological acquisition and has figured significantly in computational and theoretical research in phonology. However, one persistent challenge for this line of research is inducing nonlocal cooccurrence patterns (Hayes & Wilson, 2008; Gouskova & Gallagher, 2020. Most previous phonotactic learners locally evaluates the contiguous n items (local ngrams) as phonological constraints, especially the baseline Maximum Entropy (MaxEnt) learner (Hayes & Wilson, 2008). As the length n increases, the search space grows so quickly that it becomes intractable; their learner cannot efficiently detect cooccurrence patterns over arbitrary distances. For instance, instead of directly penalizing the nonlocal dependency of two sibilants, the learner can only inefficiently approximate *s. . . S by penalizing the enormous combinations of nitems e.g. trigram *soS, 5gram *sopoS, …. Most subsequent works on MaxEnt learner generalize nonlocal phonotactics by searching local ngrams over postulated tiers/projections (Wilson & Gallagher, 2018). Gouskova & Gallagher (2020) further offered a method for inducing tiers from placeholder trigrams, however their learner is only shown to succeed on data in which the target phonotactics largely occur in local trigrams rather than nonlocal dependency at arbitrary distance.
The current study challenges the local ngrams as the presumed hypothesis space of phonotactic models in MaxEnt approaches, and develops a probabilistic phonotactic learner based on the Strictly Piecewise class of subregular languages (Heinz, 2010). The implemented learner successfully learns both segmental and featural representations of Quechua, and correctly predicts the acceptability of nonce forms in Gouskova & Gallagher (2020).

Motivations
The current study is grounded on the "Subregular Hypothesis" (Rogers et al., 2013; Heinz, 2018 which argues that most phonological generalizations belong to a restrictive subregular region in Chomsky Hierarchy. Each subregular language can be characterized by a corresponding finite set of constraints (the grammar), and a Deterministic Finitestate Autotmata (de la Higuera, 2010), which can be efficiently implemented and computed. In particular, the Strictly Local (SL) and Strictly Piecewise (SP) languages correspond to the least expressive logic and lowest computational complexity.
This computational characterization of phonological typology leads to the argument that, phonotactic learning should search through this restrictive region to be faithful to the finite human cognition, instead of the intractable infinite hypothesis space. Specifically, previous studies have shown that Strictly Piecewise languages provide a plausible hypothesis space for nonlocal phonotactics (Heinz, 2010; Rogers et al., 2009; Rogers & Pullum, 2011. There are two extremes of phonotactic models: the discrete, categorical/boolean, and qualitative one which precludes any noncategoricality in grammar, and the continuous, probabilistic, and quantitative one which denies the value of any categorical generalization (Norvig, 2012; Manning, 2003; Bod et al., 2003; Chater & Manning, 2006. Given a string (or phonological word) in a language, a categorical/boolean model predicts a binary value ("the string is/isn't in this language"), while a probabilistic model predicts a probability distribution ("the string has 1% probability of occurrence as a randomly selected word"). A probabilistic model is more favorable in handling noisy corpus data, because it assigns lower probability to, instead of categorically penalizes, illegal forms in the corpus.
Previous studies on subregular phonology focus on the computational characterization of phonological typology instead of accounting for noisy corpus data, and usually work on nonprobabilistic phonotactic models (Heinz et al., 2011; Jardine & Heinz, 2016; McMullin, 2016; Jardine & McMullin, 2017. However, it's incorrect to claim that subregular approach is incompatible with probabilistic phonotactic model per se. Following Heinz & Rogers (2010), the current study starts to bridge the gap between probabilistic approach and a subregular phonotactic model.
The current study also incorporates featurebased representation into the proposed phonotactic learner. Previous works showed that featurebased phonotactic learners are capable of handling unattested data (Albright, 2009; Wilson & Gallagher, 2018; Mayer & Nelson, 2020a. The current study focuses on implementing a featurebased phonotactic model without the full representation of natural clases such as *[+nasal, +voice, …][nasal, +voice,…] because of the unknown role of the potential feature interactions (Heinz & Koirala, 2010).

Contributions
The current study gives insights to computational modeling of phonotactic learning by formally restricting the parameter space of phonotactic model. Previous works on MaxEnt model rooted in the assumption that local ngrams provide the baseline hypothesis space (Hayes & Wilson, 2008). This assumption excludes the alternative structures such as the Strictly Piecewise stringsets which naturally describe nonlocal dependencies (Heinz, 2007, 2010; Heinz & Idsardi, 2017.
Moreover, studying and implementing an SP phonotactic model and learner bridges the gap between the theoretical works in Formal Language Theory (FLT) and corpus data. Instead of accounting for the noisy data from natural languages, the research program of FLT has concentrated on the demarcation of linguistic typology with respect to the computational complexity. FLT approaches to learning usually assume exceptionless categorical phonotactics and symbolic phonological representation, and thus unable to handle noisy corpus data (Wilson & Gallagher, 2018; Gouskova & Gallagher, 2020.
The current study also provides tools for future study of statistical learning over other subregular classes. Although the current study focuses on SP languages, the proposed phonotactic learner can be extended to any other subregular classes, such as Strictly Local and Tierbased Strictly Local languages (Heinz et al., 2011; Jardine & Heinz, 2016; McMullin, 2016; Jardine & McMullin, 2017. There is abundant room for further progress in determining the necessary structural assumption of statistical phonotactic learning.
Furthermore, SP phonotactic model is of great theoretical interest as a variant of probabilistic finite state automata (PFA) which is formally equivalent to Hidden Markov Model (HMM) (Vidal et al., 2005a,b). SP phonotactic model is surprisingly similar to Factorial Hidden Markov Model (FHMM) (Ghahramani & Jordan, 1996; Durrieu & Thiran, 2013; Nepal & Yates, 2013 in that they both synchronize over multiple Markov chains, enabling them to make predictions based on global context. In terms of underlying structure, however, SP phonotactic model is more restrictive than FHMM because of its deterministic nature, and therefore closely aligns with the phonological typology argued in Subregular Hypothesis. In other words, SP phonotactic model provides a fertile ground for understanding the nonlocal phonological generalization with a sufficiently expressive and restrictive underlying structure.
This article is organized as follows: Section 2 introduces the Strictly Piecewise phonotactic model; Section 3 solves the learning and evaluation of SP phonotactic model; Section 4 applies the SP phonotactic model and learner to the case study of laryngeal cooccurence pattern in Quechua. A checklist of involved notations and terminologies is provided below.

Nonlocal phonotactics and Strictly Piecewise phonotactic model
Nonlocal/longdistance phonotactics is the speakers' knowledge of possible and impossible nonadjacent sound sequences (Gorman, 2013), which often indicates harmony patterns in inputoutput mappings. The current study characterizes nonlocal phonotactics by incorporating the structure from Strictly Piecewise grammar to generalize nonlocal phonotactics from noisy corpus data.

SP grammar and language
A SP grammar evaluates subsequences instead of substrings as in ngram models. Given a string abcd, the 2long substrings include {ab, bc, cd}, while the 2long subsequences include {a . . . b, a . . . c, a . . . d, b . . . c, b . . . d, c . . . d} Coemission probability One may convert a SP grammar to a probabilistic grammar by mapping each subsequence to real number instead of Boolean value (Heinz, 2010). Illegal subsequence * s…tS h will be associated to lower probability e.g. 0.01, while legal subsequences s…s receives higher probability e.g. 0.99. The parameters of subsequences in a probabilistic SP grammar are similar to violable constraints in constraintbased grammar (Prince & Smolensky, 1993; Smolensky & Legendre, 2006. A probabilistic grammar generates a stochastic language -a probabilistic distribution over all possible strings A * , and assigns low probability, instead of False, to a illegal string. One issue in any probabilitic grammar is that long words always receives lower probabilities (Daland, 2015), therefore the word length must be controlled in the comparison of word likelihood (see Section 4). Gouskova & Gallagher (2020) mentioned SP language as 'nonlocal ngrams', and they claimed that it's impossible to implement a computationally efficient search through nonlocal ngrams. The current study proposes a solution by encoding SP grammar into SP phonotactic model, which is a set of weighted deterministic finitestate automata (WDFAs) Figure 1 shows the SP phonotactic model banning { * a . . . a, * b . . . b} ("No a following a, Each transition corresponds to certain weights which forms the parameter of SP phonotactic model, and the parameter weights from the second states can be interpreted as the weights of subsequences. Formally, the parameter W (M, q, σ) ∈ [0, 1] is the parameter weight given a factored machine M, a segment σ, and the state q reached by its prefix.

SP phonotactic model and weighted automata
Analysts can interpret parameters on second state q 1 in each factored machine as schematized nonlocal phonotactics, as illustrated in Table 2. y is the symbol emitting by the automata after the preceding symbol x. In computing the probability of a symbol σ i in a word w, the parameters on multiple WDFAs are synchronized by coemission probability. The coemission probability that a symbol σ i is emitted after the SP phonotactic model reads the prefix σ 1 σ 2 . . . σ i−1 is: For each WDFA M j , q is the state that the WDFA is in after reading the prefix. The likelihood of a word w of length N is the product of coemission probabilities given the parameters Θ M in factored automata M: (2) Figure 2 shows the path of ababa and the calculation of coemission probability.

Figure 2:
The derivation of ababa in a segmentbased SP model For example, after reading the first segment a, M 1 enters state q 1 from q 0 , M 2 is in state q 0 , the co emission probability of the second segment b in ababa is: The likelihood of ab and ababa⋉ is obtained as follows: The SP phonotactic model in Figure 1 disfavors ababa than ab by assigning a lower probability to ababa. SP phonotactic model can be applied to any natural languages. The model in Figure Table 3 A featurebased SP model is the product of factored WDFAs M j , in which target symbols on transitions are feature values instead of segments. Figure 4 shows the featurebased SP model with respect to the simple feature system in Table 3. Each WDFA has a corresponding feature value, e.g. .  The probability that a feature value V Fj (·) for segment σ i is emitted after the SP phonotactic model reads the prefix is: The model keeps track of the position of each segment, while computing the coemission probability with respect to the feature values of each segment.
The calculation of word likelihood in featurebased SP model is the same as in segmentbased SP model. The baseline featurebased model assumes the probability of one feature doesn't depend on the other feature. However, WDFAs can express parameters with certain degrees of featural interactions such as * [+F, −G] . . . [+F, +G]. This issue might also be resolved by enriching the representation with natural classes, which is not treated in the current paper (see Chandlee et al. (2019) for a solution based on partially ordered structure of feature system).

Statistical learning in SP language model
This section addresses the learning problem in SP language model.

Learning problem in SP language model
When the structure of WDFA M is known, let S be a finite sample of words drawn from the observed probabilistic distribution D, the learning problem is to estimate the optimal parametersΘ M of M so that the generated stochastic language maximally approaches D. The parameters Θ M are parameter weights on WDFA M.
lhd(S|Θ M ) is the product of probabilities for all words in the sample, which might cause underflow in practice. Instead, we transform this learning problem to log space, i.e. minimizing the negative loglikelihood of a distribution:Θ Shibata & Heinz (2019) demonstrates that this learning problem is a convex optimization problem (Boyd & Vandenberghe, 2004), in which the global optimum is guaranteed to be approximated by any algorithm. I applied Adam algorithm (Kingma & Ba, 2014) to solve the optimization problem 1 . The loss function is the calculated nll, and the parameter weights which are initialized as 1. I set the learning rate to 0.005, and train the model over 20 epoches with respect to a randomized 60/40 training/validation split. In each epoch, the learner is trained on training data (60%), and the obtained model is tested on validation data (40%). The gradient of optimization is obtained with the autograd package on PyTorch, which provides automatic differentiation for all operations in forward algorithm. Adam is applied instead of Stochastic Gradient Descent (SGD) since SGD might very quickly make those unobserved parameters 0, which might cause log 0 issue.

Evaluation
The proposed learner targets an unsupervised learning problem, in which only unla belled positive evidence presented in learning data. Therefore, the learned model cannot directly predict the categorical acceptability in testing data. The learned SP model is evaluated with respect to perplexity and clustering, instead of accuracy in classification tasks. Perplexity ρ(x) is the exponentiated entropy (averaged nll) of all phonemes in a dataset (Mayer & Nelson, 2020b).
Perplexity reflects the distance between the distribution predicted by a model and a testing data. The lower bound of perplexity of perplexity is 1-the closer to 1 the perplexity, the better the learned model. The upper bound of perplexity is the amount of possible random events |x|: each event receives an equal probability p(x i ) = 1 |x| , therefore ρ(x) = |x| after the derivation. In clustering, the model assigns nll to each word in testing data in which the acceptabilities are labelled. Mann-Whitney U test (Mann & Whitney, 1947) is applied to test if the distributions of legal and illegal words are distinctive. Mann-Whitney U test is a nonparametric statistical method which counts the amount of observations from the first distribution that precede each observation from the second distribution by magnitude. A nonparametric test avoid assuming any specific shape, e.g. normal distribution, of the distributions in comparison. In the current study, the magnitude of this test is nll, which is interpreted as the grammaticality of each word. The test yields a pvalue which decide whether the legal words are more likely to have a lower nll i.e. higher likelihood than illegal words.

Case study: Quechua
SP phonotactic model is applied to laryngeal cooccurrence patterns in (South Bolivian) Quechua.

Previous work and consequence
In Quechua, nonlocal stopejective and stopaspirate pairs are illformed ("stops" here include plain voiceless stop, ejective, and aspirated stop).
In a segmentbased SP model, the perplexity is minimized to 4.8. Based on the data in Gouskova & Gallagher (2020), the baseline MaxEnt learner achieved 9.5 perplexity. Their tierbased learner, surprisingly, achieved a higher perplexity 12.9, which might suggest that their learned model is not converged, and although their tierbased learner performs well in generalizing nonlocal phonotactics, the learned distribution is in fact further away from the target distribution than the baseline learner.
The distributions of legal and illegal words are significantly distinct with respect to MannWhitney U test: the pvalue is 2.945 · 10 −132 for illegalejective v.s. legal and 2.046 · 10 −185 for illegalaspirate v.s. legal, as illustrated in Figure 5. The magnitude is negative log likelihood, and each plot includes two subplots based on the syllabic structures. The distributions are clustered with respect to three categories: illegalasiprate, illegalejective, and legal. The promising result is replicated in the featurebased model, where the model converges to 5.37 perplexity. In the clustering task, the pvalue is 9.806 · 10 −37 for illegalejective v.s. legal and 2.113 · 10 −39 for illegalaspirate v.s. legal, as illustrated in following boxplot: The overlaps between the negative log likelihood of legal and illegal words are correlated with the size of parameters in a phonotactic model. That's exactly the reason why the legal and illegal words seems less distinctive in featurebased model which always has more parameters than segmentbased model. Nonetheless, the statistical test justifies the argument that learnt SP model distinguishes the distribution of legal and illegal words.
To summarize, SP phonotactic learner successfully learned the model which assigns lower probability to illegal than legal words in Quechua.

Comparison with MaxEnt approach
In learning, the current study and MaxEnt approach both follow the method of Maximum Likelihood Estimation, and obtain the optimal parameter weights by maximizing the likelihood of the observed forms (Mohri et al., 2018; Berger et al., 1996; Hayes & Wilson, 2008. Moreover, the implementation of MaxEnt model relies on finitestate automata as well. In Hayes & Wilson (2008), each constraint is represented as one weighted finitestate automata.
The key issue of learning phonotactics lies on the structure of a grammar, which is the abstract knowledge about the hypothesis space in learning. A reasonable and falsifiable approach relies on understanding and discovering the necessary and sufficient structure for local and nonlocal interactions. The current study and MaxEnt approaches significantly diverge in this matter. Hayes & Wilson (2008)'s Maximum Entropy learner hypothesizes local ngrams as parameters/constraints, and cannot efficiently detect nonlocal restrictions without postulating tiers/projections. For instance, suppose a learner only recognizes one string abcd, the learner will hypothesize that the recognized ngrams are legal and have higher probabilities than any other possible ngrams. Meanwhile, unigrams provide the alphabet of the language.
n legal local ngrams illegal local ngrams 1 a, b, c, d *e, *f , * … 2 ab, bc, cd *aa, *ac, *ad … 3 abc, bcd *aaa, *bbb… The parameter space will explode if the learner exhaustively search local ngrams to approximate nonlocal interactions. For instance, baseline MaxEnt learner will memorize local trigrams *abc, *acc, *adc…to approximate nonlocal constraint *a . . . c. When the nonlocal phonotactics are at arbitrary distance, the hypothesis space of baseline MaxEnt learner exponentially grows (*abba, *abbba, *abbbba, …), as shown in Gouskova & Gallagher (2020). Gouskova & Gallagher (2020) induces tiers from local trigrams, instead of storing all local trigrams to approximate nonlocal interactions as in baseline MaxEnt learner. For instance, after observing *abc, *acc, *adc …, the learner will hypothesize {a, c} as one tier. The learner will further discover tierbased local ngrams as its constraints, such as *a . . . c. Constrained by the nature of local ngrams, their learner cannot induce tiers from the nonlocal interactions at arbitrary distance, because the learner would have to keep track of all local ngrams for any n. Gouskova & Gallagher (2020) instead proposed the heuristic that only searches local trigrams, as the nonlocal phonotactics in their datasets mostly exist in CVC structures. Their approach, however, cannot directly induce nonlocal interactions over more a wider window. The learner won't learn *C 1 …C 2 if the learning data only contains evidence that the constraint holds outside of a trigram window, e.g. C 1 VCCVVC 2 V.
In contrast, the structure of SP phonotactic model entails nonlocal interactions by nature as shown in the current study. This approach doesn't predict the unattested blocking effect and closely aligns with the proposal in Agreement by Correspondence (Hansson, 2010; Rose & Walker, 2004 in which subsequences, but not tierbased substrings, are the source of harmony pattern. Another crucial difference between SP phonotactic model and MaxEnt approach is in the computation of word likelihood. MaxEnt approach assumes the word likelihood is associated to Harmony Score which is defined as the summed weights of constraint violations of a word ( i w i · C i ), while SP phonotactic model calculates the product of coemission probabilities of each segment. As mentioned above, the computation of Harmony score is implemented by the intersection of weighted finite state automata. However, it's an open question if harmony score can be applied to SP phonotactic model.
Besides learning and grammar, the current study also has different representational assumption comparing to the natural classbased representation in Hayes & Wilson (2008). Chandlee et al. (2019) has shown some promising result of learning natural classbased representation based on Model Theory (Libkin, 2013), while incorporating natural classbased representation to SP phonotactic model is left to future studies. However, these proposals also simultaneously assumes tierbased local ngrams (or tierbased strictly local language; TSL (Heinz et al., 2011)) as the hypothesis space, which predicts blocking effect in nonlocal phonotactics (Heinz, 2010). For instance, in a tierbased bigram model penalizes *sS on the tier [+strident], *soSoz is penalized since s and S are adjacent on the tier. In contrast, sozoS is accepted because the blocker z intervenes between s and S. In a probabilistic model, the blocker eliminates the potential illegal substring *sS on the tier, and sozoS receive a higher probability than soSoz. The blocking effect exists in featurebased representations as well. In Figure 8,  In contrast, searching subsequences in SP language model prevents the blocking effect. If the SP grammar is *s…S, the sozoS and soSoz are both penalized. This is true even when featural representation is entertained.

Tierbased ngrams vs. subsequences
The choice between tierbased ngrams and subsequences turns out to be a typological issue. Previous studies have shown that, blocking effects are not compatible with most longdistance agreement patterns (Heinz, 2010; Rose & Walker, 2004; Hansson, 2010 with rare exceptions (McMullin, 2016). Specifically, blocking effect is not attested in the Quechua datasets. On the other hand, blocking effects are observed in some longdistance disagreement patterns, such as Latin liquid dissimilation (McMullin, 2016).
To summarize, SP language model can capture longdistance agreement patterns without the additional tier structure, and this appears to make the correct predictions in both Quechua data and typology of assimilatory harmony systems. Future research needs to examine more closely the other patterns such as disharmony with both SP and TSL language model.

Conclusion
The current study has proposed a probabilistic SP phonotactic model and a learning algorithm. Through a case study of Quechua laryngeal cooccurence pattern, this paper shows that SP phonotactic model precisely characterizes nonlocal phonotactics and the proposed learner generalizes both segmental and featural representations from noisy corpus data. Inspired by FHMM (Ghahramani & Jordan, 1996; Durrieu & Thiran, 2013; Nepal & Yates, 2013 and the stateoftheart optimization algorithm (Kingma & Ba, 2014), the implementation of SP phonotactic model and learner bridges the gap between theoretical FLT approach and statistical learning, which can be further applied to other datasets from natural languages.
This paper also draws a comparison between the structural assumptions of local ngrams in MaxEnt approaches and nonlocal ngrams, or Strictly Piecewise language, in SP grammar. The current study sheds light on the scientific study of the necessary and sufficient structure for learning both local and nonlocal phonotactics.
There are several future directions. First of all, it's an open question whether SP phonotactic model can be incorporated into a constraintbased grammar by modifying the computation of word likelihood as the intersection of weighted finite state automata (Hayes & Wilson, 2008). Respectively, future work is required in MaxEnt approach to compute harmony score by means of coemisson probability as in SP phonotactic model, which naturally implicates nonlocal interaction. Another possible area of future research would be to extend the proposed phonotactic model and learner to other subregular languages, such as Strictly Local and Multitier Based Strictly Local languages (Heinz, 2018; Lambert & Rogers, 2020. The learning problem for any subregular languages is the same as SP language (Θ = arg min Θ M (nll(S|Θ M ))). Moreover, Shibata & Heinz (2019) has shown the convexity of this learning problem as long as the production of word likelihood of specific subregular language is defined with respect to coemission probability. One can easily modify each factored WDFA in SP phonotactic model to represent one tier, and model multitier interactions through coemission probability. Moreover, the current study can be extended to modeling inputoutput phonological maps as Probabilistic Finitestate Transducers (Vidal et al., 2005b).