Learning Stress with Feet and Grids *

This paper investigates quantity-insensitive stress learning using the MaxEnt learner of Pater and Prickett (2022) and compares the performance of the learner equipped with three different constraint sets: a foot-based constraint set and two grid-based constraint sets, one drawn directly from Gordon (2002), and one that changes the formulation of the main stress constraint to match the foot-based learner. The learner equipped with the foot-based constraint set succeeds at learning all the languages from the Gordon (2002) typology that it can represent; the structural ambiguity of the foot-based representations is not a problem in this regard. The foot-based learner also learns the languages as quickly in terms of number of epochs as the faster of the grid-based learners, which is the one with the revised main stress constraint. We conclude that the foot-based learner and the grid-based learner fare similarly well in this initial comparison on a typologically grounded set of learning problems.


Constraint sets and learning problems
Like the constraint set in TS, our foot-based constraint set draws on the proposals in Prince & Smolensky (1993/2004 and McCarthy & Prince (1993). It differs from TS in a few ways. First, it adopts the standard formulation of the constraint preferring initially stressed feet, which we call 'TROCHAIC'. As PP discuss, the TS constraint that prefers trochees, which also penalizes monosyllabic feet, does not allow for an analysis of Latin and similar quantity sensitive languages. Second, our 'NONFINALITY' constraint assigns an extra penalty for final stress, following Prince & Smolensky (1993/2004. And third, we adopt a proposal from Kager (2005) to eliminate the constraints that Tesar and Smolensky call 'WORDFOOTR' and 'WORDFOOTL', which penalize words whose final or initial syllables respectively are not footed.
(1) Foot-based constraint set FTBIN: Assign a violation for each foot that consists only of a light syllable. PARSE: Assign a violation for each syllable that is not in a foot.
IAMBIC: Assign a violation for each initially stressed disyllabic foot.
TROCHAIC: Assign a violation for each finally stressed disyllabic foot.
NONFINALITY: Assign one violation if the final syllable of a word is footed, and a second violation if the final syllable is also stressed.
MAINR: Assign a violation for every syllable intervening between the syllable with main stress and the right edge of the word.
MAINL: Assign a violation for every syllable intervening between the syllable with main stress and the left edge of the word. ALLFEETR: Assign a violation for every syllable intervening between the right edge of each foot with the right edge of the word. ALLFEETL: Assign a violation for every syllable intervening between the left edge of each foot with the left edge of the word.
These constraints can generate most, but not all of the patterns in Gordon's typology of quantity insensitive stress. One notable example of a pattern that they can capture is one in which primary stress falls on the final syllable, and secondary stress falls on the initial, even in disyllabic words, instantiated as Québec French in the typology. Gordon (2011) cites this pattern as problematic for a foot-based system. It may not be at first obvious how this constraint set can account for it, since if it is an iambic system as the final stress would seem to indicate, why should the initial syllable of a disyllable be stressed as in the correct (σ)(σ) rather than parsed as the weak member of an iambic foot, leading to the incorrect *(σσ)? An analysis is available with trochaic feet, since the correct form can be generated with MAINR preferring final main stress in (σ)(σ) over a single trochaic foot as in the incorrect *(σσ).
One language in Gordon's typology that cannot be generated by this constraint set is the Indonesian stress pattern as provided in Hayes & Wilson (2008); the description is attributed to Cohn (1989) in Gordon (2002). This pattern, like Garawa which we will discuss in detail in Section 5, involves what is termed an initial dactyl. As we discuss in that section Indonesian escapes analysis in our constraint set because of the elimination of WORDFOOTL. Four other languages that our constraint set fails to capture are the ones that involve ternary iteration: Cayuvava, Ioway-Oto, Pacific Yupik, and Winnebago. These are also not generated by the Tesar and Smolensky constraints, and the expansions of the constraint set needed to capture them are non-trivial and controversial (see recently Martínez-Paricio & Kager (2015)). Finally, two of the languages in the Hayes & Wilson (2008) instantiation of the Gordon typology have multiple stress patterns for a given string of syllables. As we are following TS in studying only deterministic stress placements we also omit those two, Estonian and Walmatjari. This leaves 26 languages in our test set.
We have two grid-based constraint sets. The first is adopted directly from Gordon (2002). The constraints are as follows, with definitions rephrased so as to correspond to ones we give for the foot-based constraints. In the grid-based representation assumed here, syllables are mapped to level 0 grid marks. Level 1 grid marks indicate secondary stress, and level 2 grid marks indicate primary stress.
(2) Grid-based constraint set (Gordon (2002)) ALIGN(X 1 , L, 0, PrWd) (ALIGN1LPRWD): Assign a violation for every grid mark at level 0 intervening between each grid mark at level 1 and the left edge of the word.
ALIGN(X 1 , R, 0, PrWd) (ALIGN1RPRWD): Assign a violation for every grid mark at level 0 intervening between each grid mark at level 1 and the right edge of the word. ALIGN(EDGES, level 0, PrWd, X 1 ) (ALIGNEDGES): Assign a violation for every grid mark on the edges at level 0 that does not have a grid mark at level 1 (maximum 2 violations per word).
NONFINALITY: Assign one violation if the final syllable has a grid mark at level 1.
*CLASH: Assign a violation for every pair of consecutive syllables that have a level 1 grid mark.
*LAPSE: Assign a violation for every pair of consecutive syllables lacking a level 1 grid mark.
*EXTENDEDLAPSE: Assign a violation for every triplet of consecutive syllables lacking a level 1 grid mark.
*LAPSERIGHT: Assign one violation mark if there is more than one level 0 grid mark (≥ 2) intervening between the rightmost level 1 grid mark and the right edge of the word.
*LAPSELEFT: Assign one violation mark if there is more than one level 0 grid mark (≥ 2) intervening between the leftmost level 1 grid mark and the left edge of the word.
*EXTENDEDLAPSERIGHT: Assign one violation mark if there are more than two level 0 grid marks (≥ 3) intervening between the rightmost level 1 grid mark and the right edge of the word.
ALIGN(X 2 , L, 1, PrWd) (ALIGN2LPRWD): Assign a violation for every grid mark at level 1 intervening between each grid mark at level 2 and the left edge of the word.
ALIGN(X 2 , R, 1, PrWd) (ALIGN2RPRWD): Assign a violation for every grid mark at level 1 intervening between each grid mark at level 2 and the right edge of the word.
Our second grid-based constraint set is the same as (2) except that the two ALIGN(X 2 , L/R, 1, PrWd) constraints are revised as in (3).
(3) ALIGN(X 2 , L, 0, PrWd) (ALIGN2LPRWD σ ): Assign a violation for every grid mark at level 0 intervening between each grid mark at level 2 and the left edge of the word. (i.e. just count the syllables) ALIGN(X 2 , R, 0, PrWd) (ALIGN2RPRWD σ ): Assign a violation for every grid mark at level 0 intervening between each grid mark at level 2 and the right edge of the word. (i.e. just count the syllables) These constraints do not follow the general formulation for ALIGN constraints in Gordon (2002), which requires constraints on level 2 (main stress) to refer only to level 1 (secondary stress). Though formulated in terms of grids rather than feet, they assign violation marks in the same way as the MAINL/R constraints in the foot-based constraint set, by counting syllables rather than secondary stresses between the main stress and the word edge. The learning problems we investigate are similar in structure to those set up by TS. The learner is provided with a constraint set and must find a ranking (or weighting) that generates the correct form for each tableau. When there is hidden structure, correctness is defined in terms of what TS call the 'Overt' form. For the TS stress problems, and the ones we investigate here, the Overt form is a string of syllables with stress designations. In Table 1, we illustrate how the learning problem is set up for both a grid-and a foot-based learner. The 'LD' column (for Learning Data) indicates which of the Overt forms is correct; the example here specifies that the correct form has final main stress and no secondary stress. In the footbased table on the right, there are two prosodifications that correspond to the correct Overt form, shown in the 'Full Rep' (Full Representation) column. For learning to be successful in TS, one of these full representations must be made optimal. In our probabilistic approach, we follow PP in requiring that the sum of the probabilities of the correct full representations to be greater than 0.9. For conciseness, we omit the grid-based full representations, since they map one-to-one with the Overt representations. The tables also Table 1: Left: Example of a two-syllable word for learning with grid-based constraints. Right: Example of a two-syllable word for learning with foot-based constraints provide an example of a constraint assigning a violation count to each candidate: *CLASH for the grid-based constraint set and FTBIN for the foot-based one. The learning data for each language is a set of strings from 1 to 7 syllables in length, marked for degree of stress (primary, secondary, or none). These are the data used in Hayes and Wilson's (2008) MaxEnt phonotactic learning simulations for the Gordon (2002) typology, provided by Hayes (p.c.). Our approach differs from Hayes and Wilson's in using predefined, typologically grounded constraints, in allowing for hidden structure, and mapping from inputs to outputs. For each of the seven-syllable lengths, there is a tableau with competing candidate stress patterns, like those in Table 1. The Overt forms include every possible stress pattern where one and only one syllable bears primary stress. For the foot-based candidate sets, every compatible prosodification with maximally bisyllabic feet is provided for each Overt form.

Learning algorithm and test procedure
Like PP, we adopt Maximum Entropy Grammar, a probabilistic version of Smolensky's Optimality Theory (1993/2004) proposed by Goldwater & Johnson (2003). Constraints are given numerical weights rather than ranks, and the probability of a candidate is proportional to the exponentiated weighted sum of constraint violations. We also adopt PP's approach to hidden structure, which uses a version of Expectation Maximization first applied by Pater et al. (2012) to other hidden structure problems.
Our learner differs from the PP one in that we use Gradient Descent to update constraint weights, rather than the L-BFGS-B optimization algorithm. In pilot work, L-BFGS-B was found to outperform Gradient Descent in learning the TS languages (Brandon Prickett, p.c.). Gradient Descent has the advantage of providing smoother, more interpretable learning trajectories. We use the batch version of Gradient Descent described in Pater & Staubs (2013) and Moreton et al. (2017), in which each update is over the entire dataset.
We used the same learning procedure for the three constraint sets. For each one of the 26 languages, we did a run with the constraint weights initialized at 1, and 10 runs with each constraint weight sampled with replacement from a uniform distribution of 0-10. We track learning in terms of the number of updates, or epochs, that it takes the learner to reach our criterion for having a correct grammar, which as we noted above is that the correct stress pattern in every tableau must be given at least 0.90 probability. In cases of hidden structure, we sum over the prosodifications that yield the correct stress pattern. For example, in the foot-based learning problem in Table 1, we check if the probabilities that are assigned to '(σσ)' and 'σ(σ)', which are both compatible with the winner 'σσ', sum up to at least 0.90. Weights were kept positive by replacing any negative weight with zero (see Magri (2015)). The learning rate was set to 4 (a value found to work well in pilot work on the TS languages), and there was a maximum of 1000 epochs.

Overall results
When the weights were initialized at 1 the learner equipped with the foot-based constraint set (henceforth FOOT) found a correct set of weights for all 26 languages. For the 10 random initializations, the learner found a correct grammar except in a minority of runs for the three languages in (4). The number of failed runs is shown in parentheses. We discuss the failed runs for Garawa in the next section.
(4) Languages with failed runs for the FOOT learner Garawa (3), Georgian (2), Southern Paiute (1) PP classified TS languages as 'hard' if their learner failed with initialization at 1, or if it failed on all 10 random initializations. By that standard, none of the languages in the current study is hard. This success rate is particularly remarkable since we used Gradient Descent rather than L-BFGS-B.
As expected, our learner equipped with Gordon's grid-based constraints (henceforth GRID) succeeded on every run. There was similar uniform success for the version of the grid-based learner whose main stress constraint counts syllables rather than secondary stresses between the main stress and the word edge (henceforth GRID-MAIN-REV). This shows that the reformulation of that constraint does not change the constraint set's ability to capture the typology (it also succeeds on the Indonesian and the ternary languages that were left out of our test set).  Table 2: Epochs to criterion with initial weights = 1 Table 2 provides summary statistics for the number of epochs required to reach the correctness criterion with initialization at 1. The learner using Gordon's constraint set (GRID in Table 2) takes on average many more epochs to find a correct analysis than does the foot-based learner. This cannot be taken as a general finding about learning with feet and grids however, as shown by the rather dramatic drop in number of epochs to criterion for GRID-MAIN-REV. This learner's mean and median number of epochs to criterion are in fact slightly lower than FOOT, though its maximum is nearly twice as high. Table 3 shows that this general pattern holds up with random initialization (FOOT runs include only the successful ones). The Min and Max rows show the mean number of epochs to criterion for a specified language over ten runs. The difference on the maximum number between GRID-MAIN-REV and FOOT is even greater here, since the FOOT maximum drops to 94 from 143.  There are likely two reasons that GRID-MAIN-REV is so much faster than GRID. First, there is the simple fact that a count of intervening syllables will usually be greater, and will never be smaller, than a count of intervening secondary stresses, which will result in a generally larger update of the GRID-MAIN-REV constraint. Second, as Gaja Jarosz (p.c.) points out, the extent to which the weight of GRID main stress constraint is changed depends on how probable candidates with intervening secondary stress are; this does not affect the GRID-MAIN-REV constraint.

GRID GRID-MAIN-REV FOOT
In sum, we found that the foot-based learner and the grid-based learner generally took a similar number of epochs to find a correct grammar when the main stress constraint was made to apply in the same way in the two systems. It terms of real time though, the grid-based learner is considerably quicker. When we ran the learner implemented in Python (https://github.com/blprickett/Hidden-Structure-MaxEnt.git) on Google Colab GRID type learning took approximately 45 seconds to go through 1000 epochs while FOOT learning took 187 seconds, which is 4 times longer than GRID. This is due to the greater complexity of calculations for the foot-based system brought about by the greater number of candidates. The sizes of the candidate sets are compared in Table 4.  Ultimately, rather than overall efficiency, the most useful comparison amongst the learners may well focus on the relative speed at which different languages are learned by each one. The graphs in Figure 1 provide an initial comparison that shows that different constraint sets often have different relative speeds of learning the languages. For each constraint set, the number of epochs to criterion of each language (with initialization at one) was log transformed and normalized. The graphs show the three pairwise comparisons amongst our three constraint sets. Each dot represents a language, and dots close to the line are ones whose relative difficulty is similar across the pair of constraint sets. We have labeled some of the languages. Atayal and Chitimatcha have final and initial main stress respectively, with no secondary stress, and both are learned in 1 epoch by all three constraint sets. The comparison of the grid-based constraint sets in the rightmost graph has the dots closest to the line, indicating that the relative difficulty of the languages is closest for them. Lakota has peninitial main stress, and no secondary stress, and both grid-based learners find it relatively difficult. That relative difficulty is much higher than for the foot-based learner, as shown in the position of Lakota in the leftmost and center graphs. Malak Malak, which has stress on even-numbered syllables counting from the right, and primary stress on the leftmost of those, is an example of a language with the reverse distinction, with relative difficulty being higher for the foot-based learner. Arriving at a better understanding of the biases of the learners, and comparing them to typological frequencies or human learning data is a clear potential direction for future research. Some initial comparison of grid-and foot-based learners along these lines can be found in Staubs (2014b) and Staubs (2014a). The difficulty that a foot-based learner has with the type of language illustrated by Malak Malak is the focus of Staubs' work, and it is interesting in that context that Malak Malak was relatively easy for the revised grid constraint set.

Results on Garawa
The language that FOOT failed on most often was Garawa, in 3 out of 10 random initializations. Interestingly, this language displays a type of ambiguity that is somewhat similar to the unattested and difficult to learn languages that are the focus of the PP study that we build on here. In Table 5, the stress patterns that are the target of learning are shown in the 'Observed' column. An analysis that uses only trochaic feet is shown in 'Trochaic Analysis'. The 'Mixed Analysis' uses a combination of iambic and trochaic feet. The words of 2 and 6 syllables of length are shown with two prosodifications. The probabilities shown in 'Mixed Analysis' are the ones that are assigned to these outcomes by the grammar learned with one of the random initializations. In the Mixed Analysis, the final syllable is unparsed because Nonfinality has a relatively high weight. It is this ambiguity between a final stressless syllable being unparsed or being parsed as the dependent of a trochaic foot that leads to failures of learning in the unattested languages discussed by PP, and as we will see, to the failed runs for Garawa here.

Number of σ Observed
Trochaic Analysis Mixed Analysis Mixed Probabilities 1σ (σσ)(σσ)(σσ)σ 0.93 Table 5: Garawa learning data and analyses Table 6 shows the constraint weights for the trochaic analysis that was found by the learner with constraint weights initialized at 1. As mentioned above, Garawa is an example of what is called an initial dactyl language: in strings of 5 and 7 syllables of length, there is a two-syllable lapse separating the initial stress from the next one. The fixed stress on the initial syllable is due to the relatively high weight of MAINL. In particular, the correct (σσ)σ(σσ)(σσ) is preferred over *σ(σσ)(σσ)(σσ) because the weight of MAINL is greater than that of ALLFEETRIGHT. This ability of the syllable-counting MAINL/R constraints to pin the main stress at a fixed distance from an edge is the reason that Kager (2005) was able to eliminate constraints from McCarthy & Prince (1993) that align a single foot to the edge of the word, termed WORDFOOT by TS. Most of the time, in languages with alternating stress oriented to one edge, and a fixed stress at the other, the fixed stress is the main stress. Indonesian, which has an initial dactyl and main stress on the rightmost of the stresses is an exception. It is worth noting that the empirical facts for Indonesian stress are unclear; the initial dactyl pattern described by Cohn (1989) has not been instrumentally verified -see Goedemans & van Zanten (2007) and Athanasopoulou et al. (2021).   Table 7 shows the initial and final weights for the mixed analysis from table 5. As can be seen in the 'Initial weights' column, NONFINALITY started out with the highest weight of the constraints; IAMBIC was lower, but above TROCHAIC. This learner got to the correct analysis very quickly: after only 3 epochs.  Table 7: Weights for mixed analysis found with random initialization; reached criterion in 3 epochs Table 8 shows a failed run, typical of all three failed runs. Here again NONFINALITY is helping to keep stress off the final syllable, allowing iambs to be used as the non-initial feet in the highest probability parses for the 5 and 7 syllable words, just as they are in the correct analysis. Those words do not reach the criterion of 0.90 correct because too much probability is being given to non-initial trochaic feet; the weight of IAMBIC is not sufficiently greater than that of TROCHAIC. In the 6 syllable word, the problem is reversed: the medial iamb places stress in the wrong position. The trap seems to be caused by high-weighted FTBIN: the correct analysis in Table 7 uses monosyllabic feet in the six-syllable word. The failed runs all had initial weights with IAMBIC higher than TROCHAIC, and differed from the successful run in Table 7   Foot-based MaxEnt learning was remarkably successful in this initial investigation of the learning of a typologically based test set of languages. With initialization of the constraints at 1, a correct grammar was found for all of the 26 languages of the Gordon (2002) typology that the constraint set can represent, and the learner rarely failed to find a correct grammar when weights were randomly initialized. Foot-based learning was also as efficient as grid-based learning, at least when speed was measured in terms of the number of epochs to criterion. This leads us to tentatively conclude that the avoidance of hidden structure in the gridbased approach does not provide a learning argument for it, pace Gordon (2011).
Firmer conclusions await an expansion of the typology being covered by the constraint sets. As discussed above, the foot-based constraint set that we adopted cannot cover the entirety of the Gordon typology, and neither of these constraint sets is designed to cover the full range of typologically attested patterns, including also quantity sensitive ones. And in studying quantity sensitive patterns, there is a further hidden structure problem of identifying which of the syllables are 'heavy' and 'light', which TS and others abstract from. In building up to a fuller typology, there are also other learning options to explore, including Expectation Driven Learning as presented in Jarosz (2015)) and Nazarov & Jarosz (2021), and theories that do not assume a prespecified constraint set, such as the MaxEnt constraint induction model of Hayes & Wilson (2008), and the neural network approach of . Finally, going beyond success and overall speed of learning, there is a need for further study of the biases of each set of grammar and learning assumptions, in terms of the relative ease of different language types, and how the learning space is navigated on the way to success.