A dynamic neural field model of leaky prosody: proof of concept

Recent work has shown that lexical items come to take on the phonetic characteristics of the prosodic environments in which they are typically produced, a phenomenon referred to as "leaky prosody". Focusing on pitch patterns in Mandarin, we show that leaky prosody can be derived from a flat (i.e., non-transformational, non-optimizing) model of speech production. Formalized using Dynamic Field Theory, in our model, lexical, phonological, and prosodic inputs each exert forces on a Dynamic Neural Field representing pitch. Notably, the forces exerted by these inputs reflect surface distributions in a large corpus of spontaneous speech. Our simulations showed that the flat model derives the short timescale effect of prosodic prominence on pitch production as well as the longer timescale effect of leaky prosody. By updating lexical items based on surface phonetic form, words that are consistently produced in high/low prosodic prominence positions take on the phonetic characteristics of those environments.


Introduction
Sounds patterns in human language are often insightfully described in terms of discrete symbolic units.Accordingly, many of our formal approaches privilege this level of description.However, some phenomena are more challenging for purely discrete models or just fall outside the scope of what such models can explain.These include patterns of incomplete neutralization (Port & Crawford, 1989;Warner, Jongman, Sereno, & Kemps, 2004), gradual sound change (Chen & Wang, 1975), and sub-phonemic changes in representations over the lifespan (Harrington, Palethorpe, & Watson, 2000;MacKenzie, 2017).Accordingly formal approaches to these types of patterns have proposed or adopted some sort of continuous substrate (e.g., Braver, 2019;Bybee, 2002;Pierrehumbert, 2001;Roettger, Winter, Grawunder, Kirby, & Grice, 2014).This paper explores the potential of Dynamic Neural Fields (Schöner & Spencer, 2016) for providing an appropriate substrate to integrate discrete and continuous aspects of sound patterns.
To illustrate the approach, we focus on the empirical issue of "leaky prosody".Recent work has shown that lexical items come to take on the phonetic characteristics of the prosodic environments in which they are typically produced (Seyfarth, 2014;Sóskuthy & Hay, 2017;Tang & Shaw, 2021).Prosodic context often influences the duration, intensity, and pitch with which a word is realized.These phonetic characteristics of prosodic environments can be lexicalized in words that show a distributional skew to a particular type of prosodic environment.For example, in Mandarin Chinese, words that tend to attract a high degree of prosodic prominence are produced with a relatively high pitch (also greater intensity and longer duration), even in prosodically weak environments (Tang & Shaw, 2021).Similarly, words that tend to occur at phrasal boundaries, an environment that lengthens words, end up being longer in duration even in other positions, a result illustrated for New Zealand English (Sóskuthy & Hay, 2017).These effects are lexically specific and sub-phonemic synchronically but may provide the seeds for diachronic change which can be characterized in categorical terms, as in the loss of segments in frequent or informative words (Cohen-Priva, 2017;Cohen Priva, 2015;Piantadosi, Tily, & Gibson, 2011;Zipf, 1949) or the emergence of tone from phrasal prominence (Bang, Sonderegger, Kang, Clayards, & Yoon, 2018;Kang & Han, 2013).
In order to account for the leaky prosody facts in Mandarin, Tang & Shaw (2021) adopt a phoneticallydetailed lexicon, as in Exemplar Theory (Pierrehumbert, 2002), prosodic modulation based on language redundancy (Aylett & Turk, 2004;Turk & Shattuck-Hufnagel, 2020), and a feedback mechanism from phonetic output to lexical representation (e.g., Wedel, 2007).A schematic depiction of the proposal is provided in Figure 1.This is a transformational architecture in that a phonetically detailed representation of a word, stored in the lexicon, is modulated according to prosodic context, including local predictability, to yield a contextually appropriate phonetic target.The phonetic target then influences the long-term lexical representation through feedback.Lexical representations are updated by experiences with a word, possibly with some,compensation for the effects of prosody, which may be incomplete (Kuzla & Ernestus, 2011;Kuzla, Ernestus, & Mitterer, 2010).The feedback mechanism, whereby context-specific phonetics (even with partial compensation for prosody) update the lexicon, offers a possible account of the leaky prosody facts.
In this paper, we proposal an alternative architecture.Schematized in Figure 2, our alternative, presented here, is a non-transformational flat model.Rather than having lexical targets transformed according to Shaw and Tang prosody, we let three forces, a phonetically-detailed lexicon, phonological categories, and prosody, jointly influence the phonetic target.The feedback mechanism, whereby the lexical entry is updated based upon how words are produced in context is retained in the flat model and remains a key part of the explanation for leaky prosody.One of the key advantages of the flat model comes from learnability, as each input to phonetic planning can be acquired through surface-based distributional learning, a point we demonstrate in this paper.This contrasts with the transformational approach which treats speech production as a complex, and hitherto unsolved, optimization problem (Turk & Shattuck-Hufnagel, 2020).The remainder of this paper is organized as follows.In section 2, we introduce Dynamic Field Theory and provide an overview of the model architecture, as situated within this framework.Section 3 provides the formal details of the model.Section 4 presents simulations.We show how pitch planning evolves on a relatively short time-scale in planning a single pitch target, and how feedback drives change in lexical representation over a longer time scale.Section 5 discusses some of the parameters that entered into the model, limitations, and directions for future research.
2 Dynamic Field Theory as a formal framework for the flat model 2.1 Background We developed our flat model architecture within the framework of Dynamic Field Theory (Schöner & Spencer, 2016).In this framework, cognitive representations are continuous parameters governed by populations of neurons.In this paper, the continuous parameter of interest is pitch.Populations of neurons sensitive to linguistically-relevant pitch modulation have been localized in left Superior Temporal Gyrus, near other phonetic feature representations (Mesgarani, Cheung, Johnson, & Chang, 2014;Yi, Leonard, & Chang, 2019).In DFT, the distribution of activation across a neural population is represented by a dynamic neural field (DNF).Within our pitch DNF, each field location represents a population of neurons sensitive to a particular pitch value.Activation at each field location evolves over time under the influence of inputs until the DNF stabilizes.A stable activation peak at some location in the field serves as the target for movement.
Stabilization of a pitch DNF over time is illustrated in Figure 3.The z-axis (vertical) represents activation; the x-axis represents the pitch field, where each neuron in the field is selectively tuned to a particular pitch value; the y-axis represents time.In this example, which shows 60 time steps, an activation peak stabilizes at 241 Hz, indicating a pitch target of this frequency.(Stern, Chaturvedi, & Shaw, 2022;Stern & Shaw, 2022).Two recent examples come from DFT models of phonetic trace effects in speech errors (Stern et al., 2022) and contrastive hyperarticulation (Stern & Shaw, 2022).The phonetic trace effect is when, in speech errors, sounds that are categorically mis-produced, e.g., [p] in place of [b], still retain some gradient influence of the intended phoneme.For example, the voice onset time (VOT) of [p] produced in error (when [b] was intended) is slightly shorter (closer to [b]) than the VOT of non-errorful [p].Such sub-phonemic differences in VOT, found in both lab-induced errors from tongue twisters (Goldrick & Blumstein, 2006) and naturally occurring speech errors (Alderete, Baese-Berk, Leung, & Goldrick, 2021) have been modeled as multiple inputs to a DNF representing VOT.Under selection dynamics, strong input from a voiceless stop (long VOT) and weaker input from voiced stop (short VOT) stabilize in a location that is slightly shifted towards the voiced stop, deriving the magnitude of empirically observed trace effects (Stern et al., 2022).A similar account derives contrastive hyperarticulation, the tendency for words with minimal pairs to be hyperarticulated away from minimal pair competitors.Like the phonetic trace effect, contrastive hyperarticulation occurs in both experimental settings (Baese-Berk & Goldrick, 2009) and spontaneous speech (Wedel, Nelson, & Sharp, 2018).The DFT account involves minimal pair competitors projecting inhibitory input into the field, which drives the location of stabilization away from the target (Stern & Shaw, 2022).
A second useful property of DFT is that it represents cognition-in this case speech production planning-as a time-varying process.This is particularly useful for leaky prosody because our account, at the conceptual level, involves multiple timescales.On a short timescale-the relatively fast process of speech production planning for a single pitch target-prosodic context influences production.On a longer timescale-the relatively slow process of lexical consolidation-the aggregate influences of prosody alter the long term representation of words.Each of these timescales has been modelled within DFT.For example, Roon and Gafos (2016) and Harper (2021) develop DFT models of the millisecond timescale of single consonant production while Gafos and Kirov (2010) capture gradual shift in phonological representations at the longer timescale (see also Tilsen, 2019).Modelling speech production as a cognitive process that unfolds in time distinguishes DFT from stochastic generative models (e.g., Shaw & Gafos, 2015;Shaw & Kawahara, 2018), Exemplar Theories (e.g., Pierrehumbert, 2001), and agent-based models (e.g., Harrington & Schiel, 2017), which treat speech production as a timeless process of statistical sampling.For the purposes of the simulations reported on here,  is held constant, at 20.The noise term is Gaussian-distributed, , of strength q.For the simulations here, q is held constant, at 1.

Model overview
When the terms on the right side of the equation sum to zero, activation across the field is stable, i.e., no change.In the absence of inputs to the field, activation will converge on h.For all of the simulations reported in this paper, we set h to -5.When inputs to the field at some location are greater than 5, then the resting activation will be offset and activation will rise above zero.When this happens, the interaction kernel kicks in, functioning to stabilize a peak in activation.We discuss the formal mechanism of this function in greater detail below, after elaborating on the inputs.
(1) Equation governing change in activation at each field location over time

Inputs to the field
The equation for inputs to the field is given in (2).Inputs take the form of Gaussian distributions with three parameters: (i) the place, p, in the field where the distribution is centered, i.e., the mean; (ii) the width, w, of the distribution, i.e., the standard deviation, and (iii) the amplitude, a, of the distribution.To visualize the shape of the inputs, the equation is plotted below with a = 6.An input with this amplitude would drive the field above zero, causing stabilization, given our resting activation level of h = -5.
(2) Equation for inputs to the DNF

Selection dynamics
The equation for the interaction kernel is given in (3).There are three main components to the interaction kernel: (i) a local excitation component, which has a parameter for strength,   , and scope,   , and dictates the spread of Gaussian-shaped excitation; (ii) a local inhibition component, which has corresponding parameters,  ℎ ,  ℎ ; and (iii) a global inhibition component, which covers the entire field with uniform strength   .We have set the values of these parameters to ensure selection dynamics.That is, the interaction kernel will promote local activation and inhibit more global activation.The key to deriving selection dynamics from the interaction kernel is to make local excitation stronger and narrower than local inhibition and stronger than global inhibition.This pattern of inequalities is exemplified in the table in (3).Plotting the interaction kernel with the values in the table gives the shape shown below (right), which Schoner and Spencer (2016) refer to as a "Mexican Hat".

Sigmoidal gate
The interaction kernel is gated by (), a sigmoidal function, which is shown in (4).The gate prevents the interaction kernel from exerting much influence on field dynamics until activation, at some location in the field, crosses zero.When that happens, (), switches abruptly from zero to 1, essentially turning on the activation kernel, which, given the dynamics in (3), functions to create a stable peak at that location in the field.The gate has one parameter, , which was set to 4 for the simulations below.

Simulations
Using the model specified in the preceding section, we ran two types of simulations.The first simulated the effect of different prosodic positions on pitch.This serves to establish the validity of the flat model in deriving the effects of prosodic context on pitch targets.The second simulated lexical learning as a time varying (longer timescale) process.Here, we sought to derive the leaky prosody facts from lexical updating.We focus on the high tone (T1) of Mandarin.Both simulations made use of the COSIVINA toolbox (Schneegans, 2021).

Input parameters
In keeping with the potential learnability advantage of a flat model, we set all input parameters based to values extracted from a large corpus of spontaneous speech.For this purpose, we used the Tang & Shaw (2021) corpus of 1,655 Mandarin speakers.The input parameters for Sphon, the Shaw and Tang phonological input to the field, was based on the distribution of high tone pitch values found in the corpus.Across the ~41,000 instances of high tones, the average maximum pitch value was 238 Hz and the standard deviation was 94 Hz.To initialize the starting distribution of a lexical item for the Slex input, we sampled 1/500 th of the total number of high tones (N = ~86) in the corpus and calculated the mean (241 Hz) and standard deviation (99) of the sample.For the Spros inputs, we divided all of the words in the corpus (~400,000) into 24 equal-spaced bins based upon their local bigram predictability.The assumption, also taken up in Tang & Shaw (2021) is that bigram predictability is directly related to prosodic prominence.This is admittedly a very coarse-grained index of prosodic structure.However, even when more sophisticated linguistic factors are factored into the analysis of prominence, it still seems that local predictability (as well as informativity) play a reliable role in prominence (e.g., Anttila, Dozat, Galbraith, & Shapiro, 2020).Of the 24 equally spaced predictability bins, we choose two bins (4 th and 12 th ), each with ~10,000 tokens, to represent high prominence (low predictability) and low prominence (high predictability) field inputs.Our high prominence bin had a mean of 233 Hz (SD = 100) and our low prominence bin had a mean of 226 Hz (SD = 92).We set the amplitude of all of the inputs to be 6, high enough to individually overcome the resting activation, h = -5.To better visualize the effect, Figure 6 shows activation across the entire field at the last timestep in the simulations.The difference in activation peak, ~7 Hz, is on the order of magnitude reported in the literature (Tang & Shaw, 2021).At the end of each production of a word, i.e, a short time scale simulation, we updated the lexical representation, i.e., the Slex input, for each word with the new pitch value (based on the location of the stable activation peak in the pitch field).To update, we sampled 86 tokens (the same number used to initialize the distribution) from each Slex and replaced one sampled value (selected at random) with the stable pitch value from the simulation.We then recalculated the Slex parameters, p and w, based on the new distribution.The new parameters of Slex then served as input to the next production cycle.This feedback loop allows each token to nudge the underlying Slex distributions in the direction of the stable pitch target.Since prosody leaves an impact on the stabilization process, it can come to influence the lexical representation through feedback over many productions.
The simulations results are shown in Figure 7.The left side of the figure shows where the field stabilizes on each short time scale simulation run.There is variation-within a 20 Hz range-from trial to trial in where the pitch DNF stabilizes.The tendency is for the word in a high prominence position to stabilize at a higher pitch value but this is not absolute.On some trials, through the influence of noise, the low prominence word ends up with a higher pitch target.This happens more often at earlier simulations than at later simulations.The reason for this is that the lexical representations start to diverge over time.This is shown in the right panel of Figure 7.The high and low prominence lines gradually diverge, showing consistent separation from about 130 th run of the simulation.This is the leaky prosody effect.A small local effect of prosody, if consistently applied, can drive lexical separation between words that started completely homophonous.

Discussion
To summarize, our flat DFT model derived the leaky prosody facts of Mandarin pitch.We demonstrated that a small influence of prosody, estimated solely from bigram predictability, could over time cause divergence between two lexical items of the same phonological category.The production inputs to the model were surface distributions calculated from a spontaneous speech sample (~400,000 tokens; 1,655 speakers).The phonological input was based on the complete distribution of maximum pitch values for high tone syllables in the corpus.The lexical input was initialized as a sample of the high tone category.The prosodic input was the distribution of pitch values at fixed levels of surprisal (bigram predictability).We allowed these three inputs to jointly condition the evolution of the pitch DNF.The dynamics of the field ensured stabilization at a fixed location in the field, which varies from trial to trial due to noise.By updating the lexical input based on the location of field stabilization, we showed that a small degree of lexical differentiation emerges over time.
While the results serve as a promising proof of concept, there are many limitations of the current study.We just modelled one tone (Mandarin high tone, T1), just two lexical items, and just one feature dimension, pitch.Moreover, we didn't consider talker normalization or neurophysiologically plausible signal transformations (e.g., ERB, Mel).Additionally, we implemented assumptions about learning, i.e. that mental representations are faithful summaries of experience, which are likely overly simplistic (Olejarczuk, Kapatsinski, & Baayen, 2018).There are many directions in which this work can be expanded to represent more realistic scenarios.
The model has the potential to make interesting predictions for sound change.In the simulations reported here, we only updated lexical inputs based on the location of field stabilization.Of course, phonological and prosodic representations also have to be learned, so a more realistic model would update these as well.In the current simulations, since only the lexical input was updated, the phonological input (tone) functioned to work against lexical drift.That is, since the phonological input does not vary from run-to-run, it represents a constant force for stabilization at the same location in the field; this works against lexical drift.However, even if we updated the phonological representation on each run of the simulation, the anti-drift force of phonology would still persist to some degree, in most realistic situations.The reason is that, typically, there will always be more occurrences of a phonological category than of a lexical item that contains that category.For example, there will always be at least as many instances of the high tone category as there are instances of any particular word that contains that a high tone.Thus, the phonological category itself will be more stable than any given lexical item.If, however, several words of the same phonological category all shift in the same direction, this could pull the entire phonological category, which would in turn pull the remaining lexical items along.These patterns of sound change are relatively straightforward predictions of the theory, although they require feedback to both the lexical and the phonological representations.The key components that lead to these predictions are (i) a flat model with lexical, phonological, and prosodic inputs to (ii) a DNF with selection dynamics and (iii) feedback to long-term representations at both the phonological and lexical level.
Another consideration in future work is the amplitude of the inputs.We set the amplitude of all three inputs to our pitch DNF to be a = 6 so that each one individually could drive the field to stabilize, given a resting activation of h = -5.The presumption is that a speaker could plan a pitch target on the basis of any one of these inputs without the other.This would mean, for example, being able to hum the pitch of a tone category or prosodic position without activating a lexical item.Having sufficiently strong inputs from each of these sources at once allows the field to stabilize faster than if there were only one input.This makes the prediction that speech planning is faster when all three of these sources, lexical, phonological, prosodic, are engaged in tandem.

Conclusion
We showed that leaky prosody, as evidenced in Mandarin Chinese, can be derived from a flat model of speech production.Lexical, phonological, and prosodic inputs each exert forces on a Dynamic Neural Field representing pitch.Notably, the forces exerted by these inputs reflect surface distributions in a corpus of spontaneous speech.The model parameters are present in the ambient speech and can be acquired through naïve distributional learning.Our simulations showed that the flat model derives the short timescale effect of prosodic prominence on pitch production as well as the longer timescale effect of lexical drift.Pitch targets in words consistently produced in different prosodic environments gradually come to take on (lexicalize) the influence of those environments.

Figure 3 .
Figure 3. illustration of activation peak stabilization over time in a pitch DNF Figure4provides an overview of our flat model architecture.The pitch planning field (center) is a DNF parameterized for selection dynamics.It receives simultaneous input from a lexical pitch target (lexicon), a phonological pitch target (tone), and a prosodic pitch target (prosody).Over time, given the selection dynamics, the field will stabilize on a pitch target, under the influence of inputs.The stable pitch value serves as the target for a single speech production event (short timescale) and feeds back into the lexicon nudging the long-term representation towards the recent behavior.In the following section, we elaborate on the formal expression of the model.

Figure 4 .
Figure 4. flat model architecture in DFT Input parameter values estimated from the Tang & Shaw (2021) corpus 4.2 Short time scale For the first simulation, we demonstrate how the flat model, based upon surfacebased input parameters, faired in capturing the effect of prosodic context on pitch.Figure 5 shows the evolution of the field with the same lexical (Slex) and phonological (Sphon) inputs but differing prosodic inputs.The left panel shows the high prominence condition, which stabilizes at 241 Hz; the right panel shows the low prominence condition, which stabilizes at 234 Hz.

Figure 5 .
Figure 5. DNF evolution for a high tone word produced with high (left) and low (right) prominence

Figure 6 .
Figure 6.Activation across the pitch field for a high tone word produced with high and low prosodic prominence4.2Longtime scale Having established an effect of prosodic structure, operationalized as local predictability, on single word production (short time scale), we now consider a longer time scale.We simulated two words, 500 times each.The words start out with the same lexical representation, Slex.One word is produced systematically in a low prominence position and the other in a high prominence position.At the end of each production of a word, i.e, a short time scale simulation, we updated the lexical representation, i.e., the Slex input, for each word with the new pitch value (based on the location of the stable activation peak in the pitch field).To update, we sampled 86 tokens (the same number used to initialize the distribution) from each Slex and replaced one sampled value (selected at random) with the stable pitch value from the simulation.We then recalculated the Slex parameters, p and w, based on the new distribution.The new parameters of Slex then served as input to the next production cycle.This feedback loop allows each token to nudge the underlying Slex distributions in the direction of the stable pitch target.Since prosody leaves an impact on the stabilization process, it can come to influence the lexical representation through feedback over many productions.The simulations results are shown in Figure7.The left side of the figure shows where the field stabilizes on each short time scale simulation run.There is variation-within a 20 Hz range-from trial to trial in where the pitch DNF stabilizes.The tendency is for the word in a high prominence position to stabilize at a higher pitch value but this is not absolute.On some trials, through the influence of noise, the low prominence word ends up with a higher pitch target.This happens more often at earlier simulations than at later simulations.The reason for this is that the lexical representations start to diverge over time.This is shown in the right panel of Figure7.The high and low prominence lines gradually diverge, showing consistent separation from about 130 th run of the simulation.This is the leaky prosody effect.A small local effect of prosody, if consistently applied, can drive lexical separation between words that started completely homophonous.

Figure 7 .
Figure 7.The stable pitch target on each of 500 simulations (left); the value of Slex after updating (right).