The Calibrated Error-Driven Ranking Algorithm as a Solution to Oscillation in Antagonistic Constraints: A Necessary Bias for Algorithmic Learning of Kihnu Estonian

This paper investigates the learning of Kihnu Estonian, a minority dialect of Estonian (Balto-Finnic). I propose a set of constraints to account for Kihnu Estonian vowel harmony patterns, and show that they can be used to produce a restrictive grammar for Kihnu Estonian vowel harmony. With this constraint set, I model the acquisition of Kihnu Estonian vowel harmony via the application of the Gradual Learning Algorithm (Boersma and Hayes, 2001). Antagonistic constraints in the set I adopt pose obstacles to successful learning of the vowel patterns attested in the learning data. These obstacles can be circumvented via use of the update rule from the Calibrated Error-Driven Ranking Algorithm (Magri, 2012).This update rule has been argued to be detrimental to learning variation in stochastic OT. However, though it was originally proposed to address the Credit Problem (Dresher, 1999), I show that it is in fact an elegant solution to the learning problems caused by oscillating constraints when modeling acquisition of Kihnu Estonian vowel harmony.

The front marked vowels pattern together in KE, as do the back unmarked vowels. 2 Thus throughout the paper I will refer to the following sets: (2) F m : the set of front marked vowels {ae, ø, y} (3) B u : the set of back unmarked vowels {A, o, u} Section 2.2 tackles the question of what kind of constraint-based framework is required in order to account for VH in KE. The Balto-Finnic languages and their various dialects, Estonian included, demonstrate varying types and degrees of front/back VH. K&P present a set of constraints with an aim to characterize the diversity of harmony behaviour exhibited by all of these related languages. However, their constraints fail to account for a particular type of disharmony (specifically, transparency) that is characteristic of KE. The constraint set I propose in Section 2.2.1 is adapted from K&P's, but also includes some crucial additions in order to ensure that it is capable of accounting for the VH patterns of KE.
Following this, Section 2.3 addresses the issue of what assumptions and biases are necessary in order for a constraint-based learning algorithm to successfully learn a KE grammar. I focus on one bias in particular that was found to be crucial for success; that is, the update rule used in the CEDRA. Section 2.3 also presents results of learning simulations, after which Section 3 presents an in-depth discussion of the conditions under which CEDRA's update rule is a benefit or a hindrance to learning in a stochastic OT environment. Finally, Section 4 concludes.  (2), (3), (4), (12), (14)). (4) through (9) below summarize K&P's constraints that are relevant to the grammar of KE.
(5) *7: if a vowel is nonlow and unrounded, it must be front.
KE VH is rooted in agreement of the feature [back]. I assume access to a vowel tier, and build all of the more specific harmony constraints that follow on this foundation: (6) Agree(Back): adjacent segments on the vowel tier must have the same value of the feature [back]. Abbreviated as Agr(Bk).
However, Agr(Bk) on its own is not capable of describing the co-occurrences that surface in KE. For example, /i..A/ is a valid vowel sequence even though it is disharmonic, because /i/ is transparent. A harmony constraint such as the one in (7) permits sequences such as /i..A/ (disharmonic) and /i..ae/ (marked) but disallows those such as /ae..A/ (both marked and disharmonic). It is constructed using constraint conjunction: (7) Marked Vowel Harmony for /F m /: Agr(Bk) & *F m .
A word may not contain both a pair of disharmonic syllable-adjacent vowels and a vowel in F m . Abbreviated as VH(F m ).
Faithfulness constraints ensure that harmony is progressive (via the privileged status of vowels in initial syllables) while also regulating the influence of markedness constraints on all vowels in general. I include both a position-specific as well as a general identity constraint for the feature [back].
(8) Ident-σ 1 (Back): an [α back] input segment in an initial syllable must not have a [-α back] output correspondent. Abbreviated as Id-σ 1 (Bk). K&P's proposed constraints and rankings successfully describe (dis)harmony in various Estonian dialects as well as other Balto-Finnic languages. However, their analysis is not able to account for the transparent behaviour of /i/ as in KE without also inadvertently characterizing /e/ as transparent as well. See (10), in which harmony is enforced when involving vowels other than /e/, vs (11), in which /e/'s behaviour is transparent rather than harmonic. This is due to the fact that transparency under their account is a property of unmarked front vowels, a description that matches both /i/ and /e/. In order to address this issue, I propose several additional constraints below. 3 KE requires the ability to penalize /e/ in particular harmony-related contexts. Therefore to facilitate the construction of harmony constraint (13) below, I adopt this third segmental markedness constraint: (12) *e In order to ensure that disharmony with /e/ is penalized, we need a VH constraint involving /e/. If such a constraint is not included in the set, then /e/ (as an unmarked front vowel) falls into the same patterns as transparent /i/, even though it does have a back counterpart and participates in harmony. I propose this additional marked harmony constraint for /e/: (13) Marked Vowel Harmony for /e/: Agr(Bk) & *e. A word may not contain both a pair of disharmonic syllable-adjacent vowels and the vowel /e/. Abbreviated as VH(e).
The next constraint is not strictly necessary for building a grammar of KE; however, simulations show that successful gradual learning of Standard Estonian requires a counterbalance for *F m (Vesik, 2022); thus I adopt a fourth segmental markedness constraint: (14) *B u (*A, *o, *u) I propose the addition of a third and a fourth conjoined harmony constraint, which penalize disharmony in the domain of /7/ and B u respectively: (15) Marked Vowel Harmony for /7/: Agr(Bk) & *7. A word may not contain both a pair of disharmonic syllable-adjacent vowels and the vowel /7/. Abbreviated as VH (7).

Ranking
To ensure that the constraints proposed are in fact capable of characterizing KE, I use the OTSoft (Hayes et al., 2013) implementation of the Low-Faithfulness Constraint Demotion algorithm (LFCD; Hayes, 2004) to demonstrate that a restrictive grammar for KE can be constructed from these constraints, given appropriate learning data. In order to mimic a human learner, I use only positive evidence (that is, underlying forms are assumed to be identical to surface forms) as learning data for the simulations. Once the learner has completed its batch learning process, I test it on a larger set of ungrammatical forms to assess what it has learned about the grammar of KE. See Table 1 for sample data.
Learning from faithful mappings Testing on ungrammatical inputs The LFCD does in fact succeed in producing a correct, restrictive grammar for KE. It installs markedness constraints in higher strata than faithfulness constraints, wherever possible. This inherent bias, which Hayes (2004) refers to as Favour Activeness, implements a phonotactic learner in that rather than interpreting the learning data as being entirely faithful, the algorithm prioritizes markedness constraints that are satisfied by the learning data. The LFCD first installs markedness constraints that are never violated; that is, the harmony constraints for F m and /e/. Following this, the faithfulness constraints that determine how to resolve harmony errors are installed, with specific faithfulness prioritized over general faithfulness (Hayes's (2004) Favour Specificity). At this point, all winning candidates have been identified and the remaining markedness constraints are installed in the lowest stratum.
(17) LFCD-generated ranking for KE: The tableaux in (18) and (19) show how this ranking enforces harmony between participating vowels, while preserving the value of [back] for the vowel in the first syllable as well as treating /i/ transparently. 4 In (18) we see that Id(Bk) must be ranked below all of the top three constraints. (18a) shows that VH(e) must outrank Id(Bk) in order to prevent the fully faithful candidate [e..7] (which fails to harmonize for /e/) from being preferred to one that violates faithfulness but obeys harmony for /e/; a parallel argument can be made for VH(F m ) outranking Id(Bk). (18b) shows that Id-σ 1 (Bk) must outrank Id(Bk) in order to prevent the candidate [B u ..B u ..B u ], which achieves harmony by changing only the first vowel's value of [back], from being selected as the optimal candidate over a candidate such as [F m ..F m ..F m ], which achieves harmony by violating Id(Bk) twice in non-initial vowels. In other words, Id-σ 1 (Bk) ≫ Id(Bk) is necessary in order to avoid Majority Rule harmony (Lombardi, 1999;Baković, 2000).
In (19) we see that these same rankings hold for inputs with a medial (or later) /i/ as well, because the VH constraints are defined over the entire word rather than being restricted to immediate neighbours. /i/ does not participate in harmony but is transparent. Subsequent non-/i/ vowels harmonize to the first, even if they are [+back] and therefore disagree with /i/.
Given the subset of constraints included in the tableaux of (18) and (19), it appears that the last candidate in each of (18a) and (19) might be harmonically bounded by the penultimate. However, the extended view of tableau (18a) shown in (20) demonstrates that this is not the case. Broadening our view to consider all constraints at once confirms that the segmental markedness constraints in the lowest stratum (in particular *7 and *e) have a role to play in avoiding harmonic bounding. (20) a. e..7 * ! * * * * b. e..e * * * c. 7..7 * ! * * * d. 7..e * ! * * * * * * * As mentioned in Section 2.1, the /e/-/7/ pair exhibits variable participation in KE harmony. Although this project assumes categorical behaviour on the part of this pair, it is crucial to note that this variability does exist, as this is what informs the work in Section 2.3. A batch learner such as LFCD is capable of producing only discrete, ordered rankings and therefore of capturing only categorical patterns. Since it is my eventual goal to investigate learning simulations that better parallel reality in that they are both gradual as well as able to incorporate and reflect the variation in KE, this approach will not be suitable for the broader problem. Instead, I will take a step past the success of LFCD and consider an algorithm that both learns from and is able to replicate noisy data; that is, the GLA (Boersma & Hayes, 2001). This is the algorithm that will be used to model acquisition of variable patterns in future work.

Learning simulations
This research was carried out via learning simulations focusing on acquisition of KE grammar. In order to simulate a human learner acquiring KE, I chose to use a phonotactic learning model, which is presented with only positive learning evidence (that is, it assumes that the underlying form is identical to the surface form). The LFCD is a batch learner and functionally assumes that absence of evidence is evidence of absence. This produces ranking (17) presented in Section 2.2.2.
As mentioned above, the GLA has the ability to produce a variable grammar based on input frequencies. It is an online learner, processing one piece of data at a time. The GLA assesses an input based on current constraint ordering, and adjusts values based on errors. Since the GLA works gradually, without seeing the entire dataset at once, it is not able to make the same conclusion that LFCD does and assume that lack of evidence corresponds to negative evidence. This results in the inability of unviolated constraints to rise, which I discuss further in Section 3.

Data
The learning data for these simulations are based on the KE subset of the Estonian Dialect Corpus (Lindström, 2013). Appendix A and especially Vesik (under review) contain more detailed information about the raw data, but here I will focus on how the data were adapted and presented as learning inputs to the GLA. Since I am idealizing the vowel patterns in both dialects to be categorical, the small amounts of variation attested in the corpus will be ignored by means of omitting the disharmonic KE forms when determining relative input frequencies of data for the learner.
Initial bigrams and trigrams of vowels were extracted from the KE subset and the frequency of each type calculated. I then determined frequencies (relative to 1000 inputs) of each grammatical n-gram using the pair of equations in (21), for KE bigrams and trigrams. (21) where x i = number of instances of the i th ngram, n = total number of ngrams, x ′ i = relative frequency of ngram i proportional to a total of 1000, and X i = number of times to use form i as an input to the GLA.
This ensured that even grammatical forms with very low representation in the corpus could inform the learning process (i.e., by ensuring that small numbers were not rounded down to 0 out of 1000).   Result if omitted Low initial faithfulness Fully-faithful grammar (Gnanadesikan, 1995) Specific over general faithfulness VH present but initial-syllable vowel not preserved (Hayes, 2004)  In addition to the biases mentioned in Table 3, it was necessary to introduce a third bias into the GLA for it to be able to learn a correct grammar for KE: the calibrated re-ranking rule as per the CEDRA, introduced by Magri (2012).

Parameters and biases
In order to acquire a grammar equivalent to (17), the parameters of the learning algorithm must be such that markedness constraints other than VH(F m ) and VH(e) are demoted far enough, and quickly enough, that the faithfulness constraints can surpass those (without also rising above VH(F m ) and VH(e)) with enough room that noisy evaluation doesn't cause any unintended variation. Without this third bias, the necessary movement cannot occur; it must be ensured that constraints should be demoted more aggressively than they are promoted. Such a bias can be implemented by employing the update rule from Magri's (2012) CEDRA, in which demotions are always applied in full force, but the effect of each individual promotion depends on both the number of demotions and the number of promotions being applied: (22) promotion amount = number of constraints demoted 1 + number of constraints promoted × plasticity Since the majority of updates in the KE learning process involve at least as many promotions as demotions, the CEDRA update rule serves to effectively reduce the amount by which constraints are promoted in most learning trials.

Results
With parameters set as specified in Section 2.3.2, simulations with 1000 or more trials per batch converged to a grammar that correctly describes the VH of KE; see constraint trajectories and final values in Figure 2. This result aligns with ranking (17) produced by LFCD. The ordering of constraints differing by 10 or greater is almost never 6 affected by evaluation noise; therefore the resulting grammar can be summarized as in ranking (23). (23) GLA-learned ranking for KE, using CEDRA update rule: Id-σ 1 (Bk); VH(F m ); VH(e) ≫ Id(Bk) ≫ VH(7); *F m ; *e; *7; VH(B u ); *B u ; Agr(Bk) Simulations with KE inputs fail when CEDRA is not applied. Learning trajectories and results presented in Figure 3 show that the top-ranked markedness constraints are the correct ones but they are nevertheless below Id-σ 1 (Bk) and Id(Bk), resulting in a fully faithful grammar as in (24).

Constraint set K&P frame their work in terms of the typology of VH patterns in many
Balto-Finnic languages; however, their typology does not include all related languages. In particular, Figure 3: Changes in constraint values over 200,000 trials of KE learning data. The horizontal axis is truncated as there was only a single learning error occurring after trial 10,000. their constraint set is not able to accommodate KE. In light of this, it is clear that some additions and/or adaptations had to be made to their constraint set. In the rankings presented in this paper (LFCD-learned and GLA-learned), the constraints VH(7) and VH(B u ) are installed in the lowest strata, rendered inactive by the higher-ranked constraints. Nor are they referred to in some indirect way like, for example, Agr(Bk) is via its inclusion in the definition of the conjoined constraints. The justification for inclusion of these constraints does not refer to strategy but rather to symmetry. The need for VH(F m ) is clear through the work it does to identify winning candidates. *e is motivated by the need for VH(e), in whose definition it is included. While *7 and *B u , both necessary for different reasons, do not require the existence of their corresponding VH constraints, I have included the harmony constraints for the sake of continuity and symmetry: if *F m with VH(F m ) and *e with VH(e) are members of the set (that is, if CON is permitted to conjoin Agr(Bk) with some segmental markedness constraints), then *B u along with VH(B u ) and *7 along with VH(7) should be also. With that said, I have run simulations without these two VH constraints, and they succeed (as predicted, given their inactivity in the rankings presented). This is worth noting, as the additional constraints may well add a much greater degree of complexity to this typology than is necessary or justified. However, notwithstanding the potential problems associated with the inclusion of these constraints, it is the case that the constraint set presented provides the opportunity to highlight the fact that oscillation which is detrimental in some learning situations can be beneficial in others.

Utility of the CEDRA update rule
Due to the nearsightedness of the GLA and the use of positive evidence only, markedness constraints that are never violated by the learning data (eg, VH(F m ) and VH(e) in KE) are highly unlikely to ever be violated by a generated output; the only way this would happen is due to evaluation noise. Therefore, they have negligible opportunity to be promoted as a result of such an error. However, the symmetrical properties of /e/ and /7/ (due to their segmental markedness as well as their harmony constraints) result in *e and *7 staying relatively stable relative to each other, and also fairly close to their initial value, as errors that promote one demote the other and vice versa; see (25). We run the risk of producing a strictly faithful grammar (which accounts for all of the learning data but no potential unfaithful test data) if the general faithfulness constraint is permitted to rise above the markedness constraints as their values oscillate.
(25) Violation profile for a sample GLA learning error. Magri's (2012) CEDRA is motivated by a combination of two goals. The first is to enable a learning algorithm to incorporate promotion as well as demotion, in order to permit faithfulness to start low, for example, and also to allow for adjustments to rankings as new learning inputs are encountered. The second is to avoid full-fledged promotion of constraints in the case of an Elementary Ranking Condition (ERC; Prince, 2002) that contains two or more constraints that prefer the intended winner, in order to avoid overpromoting when it is not clear which of those constraints should be credited with preference of the winner (Credit Problem; Dresher, 1999). The promotion amount is thus shared among the winner-preferring constraints. This effectively creates a learner whose demotions tend to be greater than its promotions, successfully "creating space" for faithfulness between the unviolated markedness constraints, VH(F m ) and VH(e), and the rest (Figure 4). Magri & Storme (2020) revisit CEDRA, arguing that in fact it does not live up to Magri's (2012) original claim of solving the GLA's convergence problem. In the Ilokano metathesis test case that they investigate, two constraints Linearity and *P] must crucially have ranking values that are (a) equal to each other and (b) greater than that of MaxIO(P). However, due to the variation inherent to this grammar, Linearity and *P] end up oscillating during the learning simulation. Ideally, the overall trajectories of these oscillating constraints should be relatively stable over time; other algorithms such as the GLA and the Minimal GLA (Boersma, 1997(Boersma, , 1998 are able to produce such behaviour through valuing promotions and demotions equally. However, because of the CEDRA's calibrated promotion calculation, this pair of constraints oscillate continually downward instead of around an equilibrium value, and the learner is not able to converge on a grammar. This type of downward oscillation is precisely what is so useful about the CEDRA update rule in the learning of KE VH, whereas it is the downfall of the CEDRA's attempt to learn Ilokano metathesis. The two cases are underlyingly different, however, in that the oscillation in KE springs from the violation profiles (that is, it is inherent to the constraints themselves) as in (25), whereas the oscillation in Ilokano is caused by conflicting ERCs (due to variation in inputs) as in Figure 5. In the first case the downward oscillation helps to move antagonistic constraints "out of the way" of active ones, whereas in the second we wish that that learning algorithm would recognize the oscillation as being indicative of variation and stop updating the conflicting pair.

Conclusion
The constraint set proposed by K&P to account for the VH patterns of Balto-Finnic languages is insufficient to describe all such languages, in particular KE. Adaptations to this set were necessary  Table 7 demonstrates conflicting ERCs (3 and 4) after some simplifcation.
in order to be able to produce a correct grammar for KE. Once these adaptations were made, the constraint set presented in this paper was shown via LFCD to successfully produce a ranking that corresponds to a correct and restrictive grammar for the VH of KE. An existing corpus (the Estonian Dialect Corpus) was processed and analyzed to inform learning data and distributions for a gradual learner. In using the GLA (implemented by Hayes et al. (2013) as well as the author) to learn the vowel phenomena for either dialects, the CEDRA update rule must be used even though Magri & Storme (2020) argue against it in later work. This finding shows that the oscillation in constraint values caused by contradictory learning updates springing from (a) ERCs (due to variation in learning noise) vs (b) violation profiles (due to antagonistic constraints) can be either detrimental or beneficial to converging on a correct grammar, respectively.

A Corpus
The Estonian Dialect Corpus (Lindström, 2013) was used as the data source for the learning portion of this project. The corpus comprises a total of over 1.2 million words transcribed from spontaneous speech of native Estonian speakers recorded between 1938 and 1996. Speaker dialects are identified via dialect group and parish. The KE subset of the corpus was extracted by restricting entries to those from the Islands dialect group, Kihnu parish. These entries were analyzed for word type counts, identification of monophthongs, and adherence to expected harmony patterns; see Table 4.
Total entries (word tokens) 21,599 Total entries (word types) 5052 Types with ≥ 2 monophthongs 3375 …of which ≥ 3 1200 Word-initial monophthong bigrams 1973 Word-initial monophthong trigrams 1173 Harmonic word-initial monophthong bigrams 1935 Harmonic word-initial monophthong trigrams 1052 Vowels in sets {ae, ø, y} and {A, o, u} were collapsed and replaced with their representative symbols F m (front marked vowels) and B u (back unmarked vowels). Then, in order to inform relative frequencies of learning data as described in Section 2.3, the number of instances of each word-initial vowel bigram or trigram was counted.

B Python implementation of GLA
The implementation of the GLA provided by OTSoft (Hayes et al., 2013) offers the user several ways of modifying the values or processes used by the algorithm. Of particular relevance to this project are the parameters (Table 2) and biases (low faithfulness, specific over general faithfulness, and CEDRA update rule) described in Section 2.3.2. Most of these options can be selected and varied independently of but in concert with the others. However, when both a priori rankings (for biasing specific over general faithfulness) and CEDRA update rule are selected, the OTSoft ranking history file shows that only the a priori ranking is maintained; the CEDRA update rule is not simultaneously employed. In order to address this gap, I wrote a Python (Van Rossum & Drake, 2009) script 7 to run the GLA, ensuring that the specifics of the implementation would facilitate the simultaneous application of all three necessary biases.