Coarticulation with alveopalatal sibilants in Mandarin and Polish: Phonetics or phonology?

Previous work has shown that vowels following alveopalatal sibilants typically exhibit raised second formant (F2) values, typically attributed to coarticulatory vowel fronting (e.g. Stevens, 2004 in Mandarin; Bukmaier & Harrington, 2016 in Polish). This paper re-examines the palatalizing coarticulatory effects of the alveopalatal sibilant in Mandarin and Polish. While previous studies have focused on differences in F2 transitions or values at vowel onset, I find that the raised F2 values following alveopalatal sibilants frequently persist through the entire duration of following vowels in Mandarin. This raises the question of whether this is a phonetic coarticulation effect or a phonological assimilation effect. I review diagnostics for such a distinction and provide evidence from speech rate which suggests that the raised F2 effect should be analyzed as phonological assimilation in Mandarin, but phonetic coarticulation in Polish. These results have implications for phonological representations and perception in both languages.


Introduction
Previous work has shown that vowels following alveopalatal sibilants typically exhibit raised second formant (F2) values, typically attributed to coarticulatory vowel fronting (e.g. Stevens, 2004 in Mandarin;Bukmaier & Harrington, 2016 in Polish). This paper re-examines the palatalizing coarticulatory effects of the alveopalatal sibilant in Mandarin and Polish. While previous studies have focused on differences in F2 transitions or values at vowel onset, I find that the raised F2 values following alveopalatal sibilants frequently persist through the entire duration of following vowels in Mandarin. This raises the question of whether this is a phonetic coarticulation effect or a phonological assimilation effect. I review diagnostics for such a distinction and provide evidence from speech rate which suggests that the raised F2 effect should be analyzed as phonological assimilation in Mandarin, but phonetic coarticulation in Polish. These results have implications for phonological representations and perception in both languages.
1.1 Phonetics vs. phonology: Common diagnostics While the distinction between phonetics and phonology is tenuous, there are multiple diagnostics which have been used in the literature. The three main diagnostics we consider here are summarized below. This is not intended to be a complete list of diagnosticsthere is a wealth of literature on the classification of "phonetic" vs. "phonological" effects (see e.g. Cohn (2007) for a review). I only intend to highlight these as commonly invoked diagnostics.
1. Gradience vs. categoricity (e.g. Chomsky & Halle, 1968) Phonetic effects are expected to be gradient/continuous while phonological effects are expected to be categorical/discrete. Cohn (2007) argues that while this correlation is important, it is not as strong as is commonly assumed.
2. Extent of segmental effect (e.g. Keating, 1990) Phonetic effects may only affect part of the segment while phonological effects are expected to affect the entire segment. This can be seen as a particular instance of the first diagnostic. Phonological effects should categorically affect the whole segment, creating discrete categories.
3. Variation with speech rate (e.g. Solé, 2007) Phonetic (mechanical) effects should have fixed temporal extensions. Therefore, phonetic effects should not vary with speech rate. Phonological effects should have temporal extensions which do vary with speech rate.
1.2 Mandarin sibilants Many dialects of Mandarin contrast three voiceless sibilants. The nonalveopalatal sibilants have been characterized as having a variety of different places of articulation: dental, denti-alveolar, retroflex, laminal post-alveolar, and apical post-alveolar. Chang & Shih (2015) provides a review of these claims; it seems that some of this variation may be attributable to data collection in different regions. In this study, I use the terms alveolar, retroflex and alveopalatal (followig e.g. Ladefoged & Wu, 1984;Duanmu, 2007;Chang & Shih, 2015). The consonant inventory of Mandarin is given in Table 1 Duanmu, 2007) Acoustically, the three sibilants have sometimes been described as having a three-way contrast in spectral center of gravity (COG; Lee, 1999;Lee-Kim, 2011;Kallay & Holliday, 2012). Other studies have reported a two-way COG contrast between the alveolar and the other two sibilants and an F2 onset contrast distinguishing the alveopalatal from the retroflex (Stevens et al., 2004). In perception, several studies have found the primary cue for the retroflex-alveolar contrast to be COG or the position of the lowest spectral prominence (Wu & Lin, 1989;Li, 2008;Chang, 2013). Li (2008) argues that COG is not sufficient to distinguish the alveopalatal from the other two sibilants and the primary cue distinguishing /C/ is instead F2 onset of the following vowel.
There is an allophonic restriction on sibilants requiring [C] before high front vowels (Duanmu, 2007;Lin, 2014). 1 Because of this positional neutralization, some have argued that the alveopalatals can be represented as underlying velars which become palatalized before high vowels (Wu, 1994), or that /s/ palatalizes to [C] before [i] (Duanmu, 2007). However, all three sibilants contrast preceding the vowels [a] and [@u u] (Duanmu, 2007;Li, 2008;Lin, 2014). Therefore, all three sibilants are considered to be independent phonemes in many synchronic analyses (e.g. Li, 1999;Cheng, 2011) and are assumed to be distinct phonological categories in this study.
Some dialects of Manarin (Taiwan Mandarin and other southern varieties) merge the alveolar and retroflex sibilants to alveolar place. This merger is often partial: vowel context, contrastive focus, and social context all influence production of the contrast in speakers with the merger Chang, 2013). None of our speakers appeared to exhibit the merger. Even if they did, neither the alveopalatal sibilant nor F2 of the following vowel are expected to be affected by the merger, so it is unlikely that the present results would be impacted.

1.3
Polish sibilants Many varities of Polish contrast voiced and voiceless sibilants at three places of articulation: alveolar, alveopalatal, and retroflex (Dogil, 1990). The consonant inventory of Polish is shown in Table 2. Multiple studies (Nowak, 2006;Bukmaier et al., 2014) report a spectral center of gravity (COG) contrast between the dental fricative and the other two fricatives in production. Several studies have reported little difference in spectral center of gravity and other spectral measures of the frication noise (Żygis & Hamann, 2003;Bukmaier & Harrington, 2016;Lee-Kim, 2011). Instead, the alveopalatal and retroflex fricatives have been described as being distinguished by the transition of the second formant (F2) into the following vowel. 2 When reporting on differences in vowel transitions, some quantify the vowel transition by the F2 difference at two vowel timepoints (Nowak, 2006;Chiu, 2009), and some quantify the transition by reporting the F2 value at vowel onset (Halle & Stevens, 1997;Kudela, 1968). Bukmaier & Harrington (2016) analyze onset of vowel transitions and show higher F2 values in the vowels following the alveopalatal, but these values showed some overlap with vowels following the retroflex. They report F2 trajectories between the onset and midpoint of the following vowel and argue that the raised F2 values following the alveopalatal are evidence for a coarticulatory palatalizing influence.
In perception, Nowak (2006) showed that frication noise alone is sufficient to categorize isolated  (Padgett &Żygis, 2007) fricatives for native speakers of Polish. However, it is possible the speakers are not interpreting the isolated fricatives as speech and therefore have enhanced discrimination. When the Polish fricatives were placed in a VCV context, removal of the formant transitions into the following vowel made the alveopalatal and retroflex fricatives confusable, indicating the primacy of F2 over COG as a cue to the contrast between /C/ and /s ù/.
Sibilant mergers have also been observed in Polish, and it has been argued that the Polish three-way sibilant contrast is diachronically unstable. These arguments tend to center the retroflex as being particularly unstable and predict that it will merge with either the alveopalatal or the dental Bukmaier et al. (2014);Żygis et al. (2012). Mergers of both types have been reported in some nonstandard dialects of Polish (Żygis et al., 2012;Nowak, 2006;Bukmaier et al., 2014). None of the speakers analyzed here appeared to exhibit either merger.
2 Methods 2.1 Mandarin participants and stimuli 3 11 Mandarin speakers were recorded. All speakers acquired Mandarin natively in China and relocated to the United States for college or high school. All speakers were between the ages of 18-30 and most speakers were undergraduate students. All recruitment materials (emails, sign-up info, etc.) were distributed in Mandarin orthography. All parts of the experiment were conducted in Mandarin by native speaker research assistants.
The stimuli in Mandarin were CV words and rare words which were intended to function as non-words. Because the Mandarin writing system is logosyllabic, developing new symbols for non-words presents several problems for participant reading. Instead of attempting to design new and orthographically natural characters, we used rare words with existing characters as "non-words". Each stimulus was presented with the simplified Mandarin orthographic character and the pinyin script, a romanized quasi-phonemic orthographic system. With the pinyin presented alongside the logosyllabic characters, the participants were able to pronounce the intended stimulus even if they were unfamiliar with the word or Mandarin character. No participants selfreported trouble reading either orthographic system. The stimuli were read in the carrier phrase "wǒ bǎ X dú yĪ biàn" ('I read X once').
Mandarin stimuli were crossed according to the following factors: sibilant (3 levels: s ù C) × vowel (3 levels: i a u) × word status (3 levels: high frequency/low frequency/non-word) × number of syllables (2 levels) × tone (4 levels). Not all factors could be fully crossed: there is a phonotactic restriction that requires the alveopalatal sibilant before [i], so the three sibilants are only fully crossed in the [a] and [u] contexts. Due to limitations of the lexicon, some of the tones are not fully crossed with all other factors. There were a total of 137 distinct sibilant stimuli. Additional stimuli with word-initial affricates and stops were included as fillers. Word-initial non-sibilant fricatives were not included in the task. Stop and affricate tokens were elicited as fillers along with the voiceless sibilants, neither of which are analyzed here.

Polish participants and stimuli
The Mandarin data are compared with Polish data which come from a separate study of 3 native Polish speakers, all undergraduate students at the University of Massachusetts Amherst who had acquired Polish natively in Poland.
The stimuli were words and non-words where the onset consonant was a sibilant and the following vowel was one of [E a O]. Voiced and voiceless sibilants were elicited, but only the voiceless sibilants are analyzed here (Mandarin does not have a corresponding voiced series for comparison). Due to the rich inflectional system in Polish, the lexicon does not include many monosyllabic words. Therefore, the Polish stimuli included mono-, di-, and trisyllabic words. As in the Mandarin experiment, the stimuli were cross-classified according to word frequency. In The words were classified into two frequency categories, high and low. We used the Polimorf lexicon (Wolinski et al., 2012) for selecting stimuli and verified native speaker frequency intuitions with orthographic frequency data. 4 All the stimuli were recorded in the carrier phrase: "Powiedzała X od razu" ('She said X right away'). Stimuli were crossed according to: sibilant (6 levels) × vowel context (3 levels) ×word status (3 levels: high frequency/low frequency/non-word) × number of syllables (3 levels). Due to gaps in the Polish lexicon, not all factors could be fully crossed 5 and the full stimuli set included 126 stimuli.

2.3
Recording and data processing The participants were all recorded in a sound-attenuated booth using Audacity software (Audacity Team, 1999-2014 with an M-Audio Fast Track Pro Mobile Audio Interface and a Shure SM10A head-worn microphone. The recordings were sampled at a rate of 44.1 kHz with a bit depth of 16.
The participants were presented with stimuli in the relevant orthography on a laptop computer inside the booth. They were asked to produce the phrases as naturally as possible. The research assistants were trained to give feedback which encouraged natural production. 6 The stimuli were recorded in four separate blocks, each with a different random order, totaling four repetitions of each stimulus for analysis.
After recording the participants completed a word frequency judgment task with the stimuli. This was to ensure that the frequency data matched the intuitions of the participants. The judgment task consisted of filling out a survey indicating degree of familiarity with each word and took between 2-5 minutes to complete.

Data processing and analysis
The recordings were first scanned by the author and research assistants for speech errors. The recordings were forced aligned using the Montreal Forced Aligner (McAuliffe et al., 2017), which creates Praat (Boersma et al., 2001) textgrids marking boundaries at the word and segment level. The Mandarin data were aligned using a pretrained Mandarin model. 7 A new model was trained to align the Polish data.
A Praat script based on DiCanio (2013) was used to extract spectral moments of the fricatives and formant values of the following vowels. The spectral moments are not analyzed in this paper. The formants were estimated using the Burg method and extracted at 10 ms intervals throughout the duration of the vowel. Formant excursions greater than 1000 Hz over 10 ms were assumed to be tracking errors and were excluded.

Mandarin
As expected based on previous literature, we found consistent differences in onset F2 following the alveopalatal sibilant relative to the other sibilants. Somewhat less expected, we found that these differences frequently persisted throughout the entire duration of the vowel. For /u/, all speakers consistently used higher F2 following the alveopalatal sibilant throughout the vowel duration. For /a/, there was more between-speaker variation in extent of F2 raising-some speakers extended raised F2 throughout the entire duration of the vowel, while others extended raised F2 through 50-75% of the vowel. There were also vowel differences in within-speaker variation, with speakers generally exhibiting more within-category variation in F2 for sibilants preceding /u/. Data from representative speakers are shown in Figure 1. All figured were created using the ggplot2 package (Wickham, 2009) in R (R Core Team, 2013). These graphs show average F2 trajectories with Loess smoothing over normalized time.
A mixed-effects linear regression was performed predicting F2 at vowel offset (as we are interested in whether or not the raising effect persists through the entire vowel). Fixed effects were preceding sibilant (C), vowel (V), and vowel duration (as an indicator of speech rate). Interactions between C×V and C× 4 Thanks to Gaja Jarosz for providing this data and offering helpful commentary on the experimental design. 5 Non-words were used, however, word status is still not fully crossed with all the other factors if there are lexical gaps. 6 This included things like suggesting the participant speak as if they were talking to a friend and not giving a presentation, suggesting they say the phrase "in one breath" to discourage pausing before the stimulus, etc. 7 Available at https://montreal-forced-aligner.readthedocs.io/en/latest/pretrained models.html. vowel duration were also included, along with random intercepts for speaker and word. The fixed effects are summarized in Table 3 There is a significant effect of preceding C on F2 offset, with F2 higher after the alveopalatal [C] relative to both [s ù]. There is also no significant effect of the vowel [a], and no significant interactions between [a] and the other sibilant contexts. Together, these results indicate that F2 values at vowel offset are significantly higher following alveopalatal [C] in both vowel contexts.
There is also a significant effect of vowel duration for the intercept [Cu] with a positive estimate. This indicates that F2 at vowel offset increases with vowel duration. The significant interactions between [s ù] and vowel duration indicate that this relationship also holds in the other sibilant contexts. Together, these results indicate that F2 values at vowel offset are significantly higher in longer vowels following all sibilants.

Polish
As expected based on previous literature, the Polish speakers also produced differences in F2 transitions. While the speakers sometimes produced F2 differences that persisted through the entire duration of the vowel, the results are much less consistent (across speakers and tokens) relative to the Mandarin data.
Data for [a] from a representative Polish speaker are shown in Figure 2. For this speaker (as with the other Polish speakers), the raised F2 of [a] following [C] typically does not persist through the entire vowel. This is somewhat similar to the Mandarin [a] data, where some speakers also did not extend the raised F2 throughout the vowel duration. The Polish data differs from the Mandarin data in that there were no vowel contexts where speakers consistently raised F2 for the entire vowel duration.
A mixed-effects linear regression was performed on the Polish data with the same effect structure used for the Mandarin data (described in §3.1). There were no significant effects of vowel quality, preceding

Discussion
In Mandarin, our results show that the influence of the alveopalatal sibilant on the following vowel is not merely in the transitions to the vowel; vowels following the sibilant have a higher F2 throughout the duration of the vowel, even at vowel offset. I argue that these data suggest a phonological analysis of vowel fronting following alveopalatal sibilants in Mandarin. Solé (2007) proposes that phonetic or "mechanical" effects should have temporal extensions which are fixed and independent of speech rate. Therefore, if raised F2 were a purely mechanical effect, we would see raised F2 for a fixed period of time across all vowel durations. In that case, we would expect a significant negative effect of vowel duration on F2 offset. However, we found a significant positive effect of vowel duration across all sibilant contexts, including [C]. F2 values following [C] are higher for longer vowels at vowel offset. This is the opposite of what would be predicted if F2 raising were a purely mechanical effect.
The fact that F2 offset is higher in longer vowels for all sibilant contexts is likely due to increased anticipatory coarticulation with the following segment [d]. Despite this anticipatory raising in the other sibilant contexts, F2 at vowel offset is still significantly higher following [C] (Duanmu, 2007). It is difficult to maintain this analysis in light of the present results, which suggest the vowel following the alveopalatal sibilant is also [+front]. This would require the underlying /i/ to trigger bidirectional assimilation to both the preceding /s/ and the following /a/. If we instead assume an underlying [Ca] sequence, we can easily analyze the fronting as assimilation where [+front] spreads from the sibilant to the following vowel. 8 In Polish, our results show that the coarticulatory vowel fronting following alveopalatal sibilants does not have a significant effect on F2 at vowel offset-the effect does not extend throughout the vowel duration. This, combined with the lack of any speech rate effects, suggest a phonetic analysis of alveopalatal coarticulation in Polish. In our data, Polish speakers typically only exhibit raised F2 for a portion of the vowels following [C].
Previous work on perception has found that Polish speakers were only able to correctly identify sibilants when the appropriate following vowel transitions were present Nowak (2006). The present results suggest the potential for a stronger perceptual effect in Mandarin. Because F2 differences following alveopalatal sibilants typically extend throughout the entire vowel duration, Mandarin speakers may be able to identify the preceding fricative from only hearing the following vowel, or even only the offset of the following vowel. Further perceptual work will need to be done to determine exactly how listeners use the vocalic information for sibilant identification.

Conclusion
In Mandarin and Polish, vowels following alveopalatal sibilants exhibit raised F2, though the patterns differ between the two languages. I argue that F2 raising is due to phonetic coarticulation in Polish-raising only affects the vowel transitions and not the whole segment, and extent of raising does not appear to vary with vowel duration. However, in Mandarin, I argue that F2 raising is due to phonological assimilation-raising affects the whole segment, regardless of vowel duration. Further work will need to be done to determine the effects of raised F2 on perception, specifically raise whether listeners might use the entirety of vocalic information for sibilant identification.