Toward processing of prosody in spontaneous Japanese

. This paper considers how prosody in spontaneous Japanese is processed. We have conducted Rapid Prosody Transcription (RPT) perception experiments on the Corpus of Spontaneous Japanese (CSJ) and investigated how boundaries and prominences are perceived. We recruited three groups of participants from different Japanese dialects and found that (i) F0 is not a strong prominence cue in Japanese, contra Japanese literature on focus prominence (Pierrehumbert & Beckman (P&B) 1988; Kori 1989; Ishihara 2016) and (ii) Japanese allows multi-headed and headless intonation phrases, and P&B’s reset theory, i.e. focal prominence resets boundary phrases, faces empirical difficulties. We also found that (iii) both content words and function morphemes get highlighted in Japanese, and (iv) perception strategies vary cross-dialectally and listeners from different dialects perceive boundaries and prominences differently.

2.1.RAPID PROSODY TRANSCRIPTION.RPT was developed by Cole and her colleagues and aims to broaden a research target to cover spontaneous speech.They consider that listeners differ in how they perceive prosody for the same utterance.They used prosodic transcription as a tool for prosody research.Listeners identify and mark prominences and boundaries based on an auditory impression of an utterance.Transcribers are given minimal instructions such as "mark prominent words that the speaker has highlighted for the listener, to make them stand out" and "mark boundaries between words that belong to different chunks that serve to group words in a way that helps listeners interpret the utterance" (cf.Cole et al. 2010).They are given no example transcriptions and no feedback on their transcription.The boundary (b-) score and the prominence (p-) score ranging from 0 to 1 are calculated (cf.Table 3).They indicate the proportion of participants who underscore the respective word.The higher values indicate strong perceptual salience of the prosodic element.
The main findings of the previous RPT studies are (i) boundaries are cued by higher syntactic categories, i.e. S & SBAR, Conjunction, and also by non-syntactic categories of Discourse Markers (DM) and Disfluencies (DISF) in English (cf.Cole et al. 2010), (ii) prominences are cued by acoustic features of F0, duration, intensity, pitch movement LH in German, and the final position of a boundary phrase in English (cf.Baumann & Winter 2018;Baumann & Schumacher 2020;Cole et al. 2019), and (iii) the cross-linguistic difference in prominence perception is statistically significant among English, French, and Spanish (cf.Cole et al. 2019).

OUR RPT EXPERIMENT.
2.2.1.METHODOLOGY.We conducted RPT experiments.We used 13 excerpts of Corpus of Spontaneous Japanese (CSJ) released by the National Institute for Japanese Language and Linguistics (NINJAL).CSJ is a corpus of monologues and dialogues by more than 1,400 Tokyo Japanese native speakers, with the total recorded time being about 660 hours.We used 6 lecturetype monologues and 7 pseudo-lecture-type monologues.The monologues are 16 to 41 seconds long for our experiments.Since Japanese is an agglutinative language, we segmented our data set on the morpheme level (cf.(1)); our materials contain 490 content words and 490 function morphemes.
We recruited three groups of Japanese listeners (35 Tokyo Japanese (TJ) listeners (mean age 24.8, SD=0.5), 27 Osaka Japanese (OJ) listeners (mean age 25.3, SD=3), and 11 Northern Kanto (NK) listeners (mean age 23.2, SD=4)). 1 The prosody varies among dialects in Japanese; on the word (w) level, TJ and OJ have accented (H*L) and unaccented (LH) words, while NK has only unaccented words.On the phrase (j) level, TJ and NK allow dephrasing, i.e., the process of deleting a j-phrase when syntactic words form a prosodic phrase, while OJ does not (cf.Igarashi 2014).On the intonation phrase (i) level, TJ is downward (L%), NK is upward (H%), and OJ is both downward and upward.We aim to examine whether listeners from different dialects have the same perception tendencies or not.
We conducted online perception experiments via Yahoo!Crowd-Sourcing Service.After the exercise session, our participants listened to each material twice via PC, while marking boundaries and prominences on the text by clicking a mouse.Their responses were saved on the computer via LMEDS, developed by Mahrt 2016.It took about 30 minutes for our participants to complete the task.Our participants were paid in Yahoo! points.Our experiments were approved by the Ethical Committee of NINJAL.
For the syntactic analysis of boundary-marking, syntactic categories of S, S-bar, Conjunction, XP, Particle, and non-syntactic categories of Disfluency and Discourse Marker are assigned at the left edge ('[X') and the right edge (']Y') of each morpheme following Cole et al. 2010, as in (1).
, what I like is a dog.' (N.B.In the experiment, the text is written in Japanese Hiragana, i.e.Japanese cursive character, without punctuation.) We predicted that due to the downtrend in production and the F0 normalization in perception (cf.Pierrehumbert 1979), Tokyo Japanese does not employ F0 as a prominence cue.Another prediction is that in case Maximal F0 is not a crucial prominence cue, prominence is not expected to appear at the initial position of a boundary phrase.It, therefore, does not reset a prosodic phrase.
2.2.2.RESULTS.We will list four main results of our experiments.
(i) The overall inter-listener agreements are κ0.638 on b(oundary)-score and κ0.359 on p(rominence)-score on Fleiss's kappa.Like stressed languages, boundaries are easier to share than prominences in the perception of Japanese.(ii) The correlation between the b-score and p-score is weak (r=0.12 on Pearson's Correlation).(iii) The numbers of markings vary cross-dialectally, as given in Table 1, which gives the numbers of b-/p-markings larger than 0.2.2 (iv) Boundaries are cued by syntactic categories of S, ConjP, NP, and VP and non-syntactic categories of TopP and DM (cf. Figure 1).One-way ANOVA between the b-score and three dialects was F(2, 2940)=0.48,p=0.01, and further Bonferroni is significant only between TJ and OJ (p=0.01).The story is different in prominence perception.One-way ANOVA between p-score and three dialects is F(2,2940)=1.55,p<0.001, and further Bonferroni is significant between TJ and OJ (p=0.001),TJ and OJ as well as OJ and NK (p<0.001, for both).
Table 2 shows the results of regression analyses between b-/p-scores and acoustic features, and we can see that acoustic features of range F0 or duration do not predict b-scores.P-scores are more or less predicted by range F0, but not by Max(imal) F0, duration, or intensity.3.1.ANSWERS TO RQS.We had two research questions when we started this study; RQ 1 was 'What are boundary and prominence cues in Japanese spontaneous speech?More concretely, is F0 a prominence cue in Japanese, as reported in the previous studies?'RQ2 was 'Does prominence reset a phrase boundary, as the Japanese focus literature suggested?'Our answer to RQ 1 is that boundaries are cued not acoustically (cf.Table 2) but are cued by syntactic and nonsyntactic categories (cf. Figure 1).Cole et al. (2010) claim that English uses higher syntactic categories of S & SBAR and Conjunction as boundary cues (cf. Figure 2).Comparing Figure 1 with Figure 2, we see that Japanese uses lower syntactic categories of NP and VP as boundary cues in addition to higher syntactic categories.We also observe that non-syntactic categories of Disfluencies (DISF) and Discourse Marker (DM) are less effective boundary cues in Japanese than in English.As for prominence, Table 2 shows that prominences are not predicted by Max F0, duration, or intensity in Japanese.Note here that the perception of prosody in pitch language requires some caution.In pitch language, pitches gradually lower as the utterance continues, so listeners have to normalize F0 when they perceive it.Pierrehumbert (1979) claims, based on perception experiments, that when two stressed syllables sound equal in pitch, the second one is lower in F0 due to declination.Figure 3 (right), for example, illustrates two focal prominences on chocho 'major key' and tancho 'minor key' marked by the participants of our RPT experiments.We can see that the F0 peak of the second prominent word tancho is lower than that of the first prominent word chocho, while they sound equally prominent.Pierrehumbert (1979) claims that F0 is normalized when we perceive it.Table 2 shows that range F0 is a better predictor of prominence than Max F0 in Japanese.Compare Figure 3 (left) with Figure 3 (right), for example; in the former, no prominence is marked, while in the latter, with expanded pitch ranges, three prominences are marked by our participants.Thus, our answer to RQ 1, 'Is F0 a prominence cue in Japanese?', is 'no'.RQ 2 'Does prominence reset a phrase boundary, as claimed in the Japanese focus literature?' is hard to answer unless we know what a prosodic phrase is.Japanese literature has a long history of studies on lexical accents and prosodic phrasing (cf. McCawley 1968;Poser 1984;P& B 1988;Selkirk & Tateishi (S&T) 1988;Kubozono 1988Kubozono , 2007;;Ito & Mester 2012, among many others), but the prosody above the j-phrase level (cf.( 2)) is understudied.
(2) Prosodic Hierarchy (Féry 2017:36) υ utterance (corresponds roughly to a paragraph or more) ι-phrase intonation phrase (corresponds roughly to a clause) j-phrase prosodic phrase (corresponds roughly to a syntactic phrase) ω-word prosodic word (corresponds roughly to a grammatical word) F Foot (metrical unit) σ syllable (strings of segments) μ Mora (unit of syllable weight) Japanese prosody is complex and is formed compositionally by the lexical pitch accents, phrasal tones, and boundary tones (Féry 2017: 248).Unlike stressed languages like English, Japanese assigns its lexical accents by bi-tonal H*L.Japanese accent is lexically determined, and its placement is often unpredictable.Tokyo Japanese is a moraic language in which accent is realized on a mora.Figure 4 shows the pitch contours of the declarative sentence Naoya-mo oyoida 'Naoya also swam' from Venditti et al. (2008:488).It is a good example to show that the same utterance is assigned a variety of pitch movements in Japanese.
Figure 4. Pitch movements of 'Naoya-mo oy'oida 'Naoya also swam' (Venditti et al. 2008: 488) McCawley ( 1968) posits a level of the Minor Phrase (MinP), the Major Phrase (MP), and the utterance for Japanese prosody.McCawley (1968) and Poser (1984) claim that the MP is the domain of 'catathesis', i.e., a lowering of F0-contour induced by the H*L pitch accent.P&B 1988 assume the Accentual Phrase (AP), the Intermediate Phrase, and the utterance.P&B (1988:16) define the AP as a phrase bearing at most one pitch accent whose periphery is marked with an %H at the beginning and an L% at the end.(X)J_ToBI (cf.Venditti 1995; Maekawa 2011, among others) posits the levels of the AP, the Intonational Phrase (IP), and the utterance.Venditti et al. (2008) claim that L%, H%, LH%, and LHL% are phrase-final tones and that they are assigned at the AP in their framework.Intonational tones are also confusing in Japanese literature; Poser (1984) claims that the intonational H% is a final tone inserted at the MP, but P&B 1988 assume that the intonational H% is attached at the utterance level.It is not clear yet at what phrase these tones are aligned.I&M 2012 claim that there is no need to distinguish a MinP from an MP and assume φ-and ι-phrases that are recursive.Their framework does not assume a phrase that is unique to Japanese, like AP or MP but fits in the general prosodic hierarchy (2) in the prosodic literature.It is generally accepted in Japanese literature that lexical tones play some part in a larger organization of phrases or intonation patterns, but no standard view for phrasal patterns above the word level is established (cf.P&B 1988:9).To avoid irrelevant confusion in terminology, we will assume the universal prosodic hierarchy (2) in our study (for further discussion, see Mizuguchi & Tateishi (M&T) in press).
Back to our RQ2, P&M (1988) claim that focus appears at the leftmost position of an iphrase (i.e. the intermediate phrase in their term) and resets a prosodic phrase.In their framework, a boundary is inserted after the focus.Their theory predicts every i-phrase has a focus at its left-most position.But Shinya 1999 and Kubozono 2007 empirically refuted P&B's reset theory on the sentence level.Our RPT experiments found that the correlation between boundary and prominence perception is very weak (r=0.12 on Pearson's Correlation), and also show that focal prominence appears in phrase-initial, phrase-mid, and phrase-final positions in spontaneous speech (cf.Table 1).We found that some phrases do not mark focal prominence at all (cf. Figure 3 (left)).We consider these as serious problems to the reset theory, and our answer to RQ2 is 'no'.

NEW FINDINGS.
In the course of our RPT experiments, we have two new findings.One is that not only content words (73% -84%) but also function morphemes (16% -27%) are perceived as prominent in Japanese spontaneous speech (cf.Table 1).The RPT studies on stress languages report that prominence is perceived on content words only (cf.Baumann & Winter 2018 for German; Cole et al. 2019 for English, French, and Spanish;Bishop et al. 2020 for English).Japanese is an agglutinative language, and Japanese literature has long observed that function morphemes are also subject to highlighting (cf. Oishi 1959;Kawakami 1963;Kuroda 1965;Kuno 1973;Taniguchi & Maruyama 2001;Hara 2006;Tomioka 2009, among others).( 3) is an example from Tomioka (2009), where prominence is marked either on subject KEN or contrastive particle WA in (3A).
Ken-TOP pass-PAST '(At least) Ken passed.' (Tomioka 2009: 119) The prominence of the subject marks information-new focus, and the prominence of the particle marks contrastive focus in (3A) (for the kinds of Japanese focus, see M&T in press).The former is realized by local F0-boost, and the latter is aligned by a boundary tone H% /HL% in (3A).Tomioka (2009) claims that the effect of the prominence on particles is pragmatic in nature; in (3A), the contrastive reading of wa induces an alternative set {Ken passed, Mary passed, John passed…}, in the sense of Rooth (1986Rooth ( , 1992Rooth ( , 2016) ) and implies that other people possibly did not pass.Without the prosodic prominence of topic marker wa, (3A) has the 'thematic' reading in the sense of Kuno (1973), and there is no implication about other people then.Following Tomioka (2009), we claim that the prominence of particles induces an alternative set and it provokes pragmatic implicature.
We observed function morphemes of particles, tense markers, and complementizers that are perceived as prominent in our experiments (cf. the numbers in the parentheses in Table 1).What is noteworthy is that it is not a function morpheme itself that is contrastive.In (3A), for example, the prominent topic marker wa is not in contrast with other particles like ga (SUBJ) or no (GEN), and the domain of focal prominence is wider than the particle itself.In the framework of alternative semantics, the domain of prominence is determined syntactically.When the head is prominent, the domain of prominence covers the whole syntactic phrase, and the semantic type of an alternative set is subsequently determined (cf.Rooth 1986, among others).The syntactic structure of Japanese is much disputed in the literature and is not yet determined (cf.Kayne 1994;Fukui 1995;Saito & Fukui 1998;Whitman 2001, among many others).It is often claimed that Japanese lacks D, and Fukui (1995) proposes the category K(ase) as a near-equivalent of D, which is the head of the Kase Phrase (KP).On this assumption, topic marker wa, for example, can be assumed to be the head of the KP in (4a).FocC covers the whole syntactic phrase KP in (4b), and the semantic type is the type of KP, that is, <e,t>.The alternative set is contextually determined, as in (4c), and prominence is realized as a pitch movement, as in (4d); H% or HL% is assigned to the head of a topic phrase and marks prominence.Notice here that the topic marker wa is not lexically accented.H% / HL% is a boundary tone.
The story is different where content words are highlighted in Japanese.Pitches of content words are lexically determined as H*L for accented words and LH for unaccented words in Tokyo Japanese.Lexical accents are kept even when they are highlighted (cf. Figure 3).We consider that the domain of a highlighted content word is the word itself, and we propose that it does not induce an alternative set but a singleton set.(5) illustrates how Ken in (3A) is assigned prominence by F0-boost.Our second new finding is a cross-dialectal difference in the perception of boundaries and prominences among the three dialects we have considered.Japanese literature has found that prosody varies among dialects in Japanese.Igarashi (2014) shows that TJ and NK allow dephrasing, i.e., the process of deleting a j-phrase (intermediate phrase in Igarashi's term) when conjoining syntactic words to form a prosodic phrase, while OJ does not.We never expected that native Japanese perceive the same Tokyo Japanese differently depending on the dialect they speak, but Table 1 shows that OJ listeners mark boundaries more frequently than TJ and NK listeners.Figure 5 is an example where only OJ listeners mark a boundary (b-score 0.296) between object takuan-o (pickled radish-ACC) and verb tsuke-tei-ta (pickle-PROG-PAST) in (6a). ) argue that dephrasing applies in these tone languages affecting prosody.Applying dephrasing to Japanese, we propose that (6a) is a morphosyntactic input, and (6b) and (6c) are phonological outputs without and with dephasing, respectively.We know that (6b) and (6c) are production models, and we still do not know if they account for our perception tendencies.If we were on the right track to assume Japanese is a language to allow dephrasing and production strategy induces perception bias, we can account for why TJ and OJ differ in perceiving boundaries.OJ does not dephrase both in the production and perception levels.Consider Figure 6 and Table 3.Besides the difference in boundary marking, Table 1 also shows that prominence marking differs more than boundary marking among the listeners of the three dialects.TJ listeners perceive more prominences than OJ and NK listeners.It is tough to answer why, but Table 3 may give us a clue, as it shows that TJ listeners perceive the whole syntactic category highlighted, while OJ and NK listeners perceive that only the head is prominent given the same highlighting maneuver.If our observation of the perception bias of TJ listeners is correct, this will lead us to reconsider the framework of Alternative Semantics we discussed above in (4), where only the head of FoC is aligned with a prosodic prominence.We need lots for further consideration.

Conclusion and implication.
Japanese, a pitch language, employs a different perception strategy from stress languages.We have observed that there are two ways to mark prominence in Japanese: local expansion of pitch range and boundary tone.The former is realized locally to content words, and the latter is assigned to function morphemes at the boundary mid and final positions.Cross-linguistic perception variations have been observed in the literature (cf.Cole et al. 2019, among others).Our study suggests a cross-dialectal perception bias.Japanese literature has long reported a cross-dialectal difference in production.If the difference in production will affect perception, it will open a new perspective on human cognition.Future study is expected.

Figure 2 .
Figure 2. Mean b-score per category at the left and right edge (Cole et al. 2010:1161) (N.B.XP covers NP, VP, ADJP, ADVP, and PP, and S2 codes subordinate clause)

Figure 3 .
Figure 3. LEFT: Pitch movement without prominence markings in an IP (extracted from CSJ file #A01f0055); RIGHT: Pitch movement with three prominences marked in an IP (extracted from CSJ file #S00f0082) (4) a. syntactic labeling: [KP [N Ken]-[K wa]] TOP b.FoC assignment: [KP [N Ken-[K wa]] FoC 3 c.alternative set: [[Ken-wa]] FoC = {lw.person y in w | y Î De,t} = {Ken, Mary, John …} d. prosodic prominence: Ken-wa H% /HL% (5) a. syntactic labeling: [KP [N Ken]-[K wa]] TOP b.FoC assignment: [KP [N Ken] FoC -[K wa]] c. singleton set: [[Ken FoC -wa]] = {lw.person in w | y Î De,t & |D|=1} = {Ken} d. prosodic prominence: Ken-wa H*L with F0-boost Figure 5. Pitch movement of (6a), extracted from CSJ file #S00f0095 Dephrasing is a process of deleting a j boundary when we spell out the phonological output from a morphosyntactic input.Krazter & Selkirk (2020) propose a constraint DephraseGiven in English, and recent studies on tones in Lekeitio Basque and Xitsonnga (cf.Elordieta & Selkirk 2022; Lee & Selkirk 2022) argue that dephrasing applies in these tone languages affecting prosody.Applying dephrasing to Japanese, we propose that (6a) is a morphosyntactic input, and (6b) and (6c) are phonological outputs without and with dephasing, respectively.We know that (6b) and (6c) are production models, and we still do not know if they account for our perception tendencies.If we were on the right track to assume Japanese is a language to allow dephrasing and production strategy induces perception bias, we can account for why TJ and OJ differ in perceiving boundaries.OJ does not dephrase both in the production and perception levels.Consider Figure6and Table3.

Table 3 .
Coding of boundaries and prominences of Figure6