Homophone auditory processing in cross-linguistic perspective

Previous studies reported conflicting results for the effects of homophony on visual word processing across languages. On finding significant differences in homophone density in Japanese, Mandarin Chinese and English, we conducted two experiments to compare native speakers’ competence in homophone auditory processing across these three languages. A lexical decision task showed that the effect of homophony on word processing in Japanese was significantly less detrimental than in Mandarin and English. A word-learning task showed that native Japanese speakers were the fastest in learning novel homophones. These results suggest that language-intrinsic properties influence corresponding language processing abilities of native speakers.

2.1. METHOD. We used the lexicons generated form CALLHOME corpora of these six languages (Arabic: Canavan et al., 1997;English: Kingsbury et al., 1994;German: Karins et al., 1997; Japanese: Kobayashi et al., 1996;Mandarin: Huang et al., 1996;Spanish: Garrett et al., 1996). Before calculation, we excluded acronyms, interjections, proper names and affixes. Dialectal words (in specific, words from Kansei dialect in Japanese) were also excluded. We kept only one entry for each word that is represented in more than one orthographic form in the lexicon of Japanese.
To measure the degree of homophony in each language, we counted (1) the number of phonological sequences that correspond to more than one lexical entry and (2) the total number of phonological sequences within each lexicon. Count (1) is simply divided by Count (2)  It is clearly shown in the result presented in Table 1 that languages differ considerably in the degree of homophony, with the highest and the lowest differing by an order of magnitude. We further examined if homophony ratio is correlated qualitatively with phonological resources based on the data from the World Atlas of Language Structures Online (WALS). Table 2 summarizes the features of the phonological system, including consonant inventories (Maddieson, 2013a), vowel quality inventories (Maddieson, 2013b) and syllable structure (Maddieson, 2013c) of these six languages. The results in Table 2 confirm several trends that are in the expected direction. Namely, languages that are ranked higher in the table, that is, languages that have higher homophony ratios, are more likely to have smaller phoneme inventories and simpler syllable structure, and thus more limited phonological resources. These trends, though providing no direct support for the causation in our hypothesis, do suggest that there is a negative correlation between phonological resources and the degree of homophony across languages.
3. Homophone processing in an auditory lexical decision task. Given that there exist large differences in homophony ratios across languages, we proceed to investigate what the processing consequences of these differences might be. Theoretically, homophony is not a cost-free way of encoding meaning, as it violates conventional one-to-one mappings between phonological forms and meanings (Slobin, 1973), and could thus easily give rise to ambiguity and confusion. Empirically, it has been shown that children find it difficult to learn a second meaning assigned to a phonological form that has already been assigned a meaning (e.g. Beveridge & Marsh, 1991;Doherty, 2004;Mazzocco, 1997). When English words are presented in their orthographic forms without any context, adults respond to homophones more slowly than they respond to monomeaning words in both lexical decision tasks (e.g. Burt & Jared, 2016;Pexman & Lupker, 1999;Pexman et al., 2001;Pexman et al., 2002;Rubenstein et al., 1971) and word naming tasks (e.g. Biedermann et al., 2009;Edwards et al., 2004). Responses to homophones were also more prone to errors than those to mono-meaning words in semantic categorization tasks (e.g. Jared & Seidenberg, 1991;van Orden, 1987).
Inconsistencies between findings on homophone processing have mostly come from studies that examined homophone effects in Japanese and Mandarin. For example, while Tamaoka (2007) reported a homophone disadvantage in Japanese similar to that found in English within both visual lexical decision and word naming tasks, Hino and colleagues (2013) reported that such a disadvantage was limited to mono-mate homophones, and for homophones with more than one mate, participants, on the contrary, responded faster than they did for mono-meanings words. However, a recent study by Mizuno and Matsui (2016) challenged the findings by Hino and colleagues, showing that homophones in general had longer response latencies than monomeaning words, and multi-mate homophones had even longer latencies than single-mate homophones. Such inconsistencies also hold within Mandarin. Ziegler et al. (2000) first reported shorter latencies for single-character homophones in both lexical decision and word naming tasks. This advantage was strengthened when the homophone has more than one homophonous mate. These findings were replicated by Chen et al. (2009), but the opposite result, that homophones led to longer latencies in visual lexical decision, has also been reported in several recent studies (e.g. Chen et al., 2016;Wang et al., 2012). An agreement on the exact effect of homophones on word processing, is thus far from being reached.
Notably, Japanese and Mandarin have the first and second highest homophony ratios in our corpus analysis. This suggests a possible account for the reported homophone advantage in processing, namely, that since there are high proportions of homophones in these two languages, native speakers will, on one hand, have much experience with processing homophones in daily communication, and, on the other hand, have to resolve the ambiguity caused by homophones efficiently to make sure that the communications can go on smoothly. However, before we can commit ourselves to this explanation, there are several explanations that are equally compelling and need to be addressed.
First, as suggested by Hino et al. (2013), while in English most homophones have only one homophonous mate, in Japanese (as in Mandarin) homophones are much more likely to have multiple homophonous mates. The phonological representation that is activated by the target homophone would thus send feedback activation to the orthographic forms of multiple homophonous mate. The sum of activation at the orthographic level is thus stronger than what can be achieved in English, and might be strong enough to exceed the threshold for a positive decision since the task does not require settling on a single lexical item. Second, the orthographic system of Mandarin and the kanji orthographic system of Japanese, which were used in the previously discussed studies, are both logographic, which means that the orthographic forms of words in Mandarin and Japanese can directly indicate their meanings. The correspondence between orthographic forms and semantic meanings are thus closer in Japanese and Mandarin than that in alphabetic languages like English. If this were the case, visually presented orthographies in these two languages would lead to little activation of their phonological representations and further of the homophonous mates, and thus arouse little competition among homophonous items.
In the following section, we examine whether the effect of homophones on word processing is indeed different in Japanese and Mandarin from that in English. If cross-linguistic differences should be found, we would also like to explore the possibility that such differences are indeed a consequence of language intrinsic differences in the proportion of homophones. We use an auditory lexical decision task to exclude potential confounding factors arising from differences in orthographic systems, and we included direct measurement of homophonous mates in our analyses to address the potential idiosyncrasies brought by the number of homophonous mates.
3.1. METHOD. Participants of this experiment were 18-to 30-year-old native speakers of American English, Mandarin and Japanese. Participants had no immersive experience with non-native languages (i.e. living and studying in an environment where a non-native language is consistently used) before adulthood.
The stimuli we used in the auditory task included 16 multi-mate homophones, 16 singlemate homophones, 32 mono-meaning words and 64 phonologically legal nonwords. Auditory stimuli were recordings of three female native speakers, each of one language, reading those stimuli as bare words out without sentential contexts. English and Mandarin stimuli were selected directly from CALLHOME, and Japanese stimuli were selected from those used by Hino et al. (2013). Frequency and neighborhood density were taken into consideration during stimulus word selection. In this process, we noticed that homophones in English and Japanese have much higher frequency than mono-meaning words, so we made the compromise to hold the pattern of frequency (multi-mate homophones ≈ single-mate homophones > mono-meaning words) consistent across languages, instead of holding frequency equal across word groups. A similar compromise was made for neighborhood density (multi-mate homophones > single-mate homophones > mono-meaning words ≈ nonce words) for the same reason. We also counted the exact number of homophonous words associated with the phonological representation of each stimulus (Hereafter, "NumMeaning").
The experiment was run using SuperLab 5. Participants were seated in front of a monitor, with their left index finger on "z" button and right index finger on "/" button on the keyboard. When the experiment began, participants read a page of instructions on the experiment, and were instructed then to press any button to continue. All the instructions were in the native language of the participant. Within each trial of the experiment, an audio recording of a stimulus started to play through speakers immediately as the trial started. Participants were instructed to make a decision on whether the sequence is an existing word in their native language as fast and as accurately as possible. A positive decision was always made by pressing the button on the side of their dominant hand (right-handed: "/", left-handed: "z"), and a negative decision was thus always made by pressing the button on the other side. The trial ended as soon as the participant pressed a button or the time limit (5 seconds) was reached. After each trial, participants were instructed by prompts on the monitor to press any button to proceed to the next trial. The order of stimuli was randomized across word groups for each participant. There were four training trials during which feedback was provided to help the participant understand the task and get familiarized with the procedure. No feedback was provided for the testing trials.
Before data analyses, we calculated the accuracy rate for each participant, and only kept the data from participants that reached an accuracy rate of 80% or higher. This criterion left us with 20 native English speakers, 30 native Mandarin speakers and 33 native Japanese speakers. We also excluded data of nonwords and data of errors from our analyses. We then excluded the response latency data of any response that took longer than 3000 ms, calculated the mean response latency for each participant, and excluded the data of any other responses that fell out of 2.5 standard deviations from the mean response latency. This trimmed data was submitted to R (R Core Team, 2015) for analyses. We fitted a linear mixed-effect model to the data using lme4 package (Bates et al., 2015), with logarithmically transformed response latency as the dependent variable, language, logarithmically transformed relative frequency, neighborhood density and homophony as the fixed effects, the interaction between language and homophony, and the random effects of participants and stimuli.
3.2. RESULTS. We plotted the main effects of Homophony within each language and the interaction between Homophony and Language in Figure 1 using ggplot2 package (Wickham, 2016) and coefplot package (Lander, 2018). As shown in Figure 1, though none of the estimates of homophony reached significance, the estimated main effects of homophony in Mandarin and English were positive, suggesting that homophones tended to be associated with longer latencies in the experiment. The estimated main effect in Japanese, on the other hand, were negative, suggesting that responses to homophones tended to be faster. The interaction was significant when comparing Japanese with English or with Mandarin, but not when comparing English with Mandarin (p = 0.737).
Next, we reran the regression but replaced the binary variable Homophony with the variable NumMeaning, and plotted the main effects of NumMeaning within each language as well as the interaction between NumMeaning and Language in Figure 2 Est. = -6.033e-02 p = 0.146 The most noticeable change after we replaced the Homophony variable with NumMeaning was that the latter now shows significant effects on response latency: for words in English and in Mandarin, each additional meaning significantly delayed the lexical decision. Nevertheless, the estimate was almost indistinguishable from zero for words in Japanese. The results for the interaction term replicated those in the previous regression: differences in the estimates of NumMeaning was significant between Japanese and English and between Japanese and Mandarin, but not between Mandarin and English (p = 0.428).
3.3. INTERIM DISCUSSION. This section reported an auditory lexical decision task that was used to examine the consequences of homophony. We replicated the homophone disadvantage in English that has been well documented in the existing literature. Our results also provided evidence for a homophone disadvantage within Mandarin. As for the homophone effect in Japanese, however, our results showed no effect or at best some advantageous trend. The results for the interaction between language and homophony/NumMeaning suggested that the homophone effect in Japanese is extremely different from those in the other two languages, but the difference between English and Mandarin is very small.
Our findings within Mandarin and Japanese are in conflict with those exiting studies that reported homophone advantages in either language (Chen et al. 2009;Hino et al, 2013;Ziegler et al., 2000). As those studies examined visual processing, while our study examined auditory processing, we might attribute the inconsistency to the direct mapping between orthographic representations and semantic meanings in this two languages, which is mostly relevant in visual processing. Also, using direct measures of the number of homophonous mates, we found that additional meanings did not lead to advantage in Japanese, and actually delayed lexical decision in Mandarin and English. This would lead us to argue against the theory that having more items activated could actually help overcome the disadvantage (Hino et al., 2013;Ziegler et al., 2000). Our results support cross-linguistic differences between Japanese and the other two languages. This difference in homophone processing is strongly correlated with the intrinsic differences in homophone distribution across languages; we have mostly ruled out two major alternative explanations available. 4. Homophone processing in a word learning task. In a second experiment, we used a wordlearning task in a visual world paradigm to further explore the cross-linguistic differences regarding homophone processing, with the hope of corroborating the findings in the auditory lexical decision task. 4.1. METHOD. 16 native English speakers, 16 native Mandarin Chinese speakers and 9 native Japanese speakers participated in the experiment.
We trained the participants to learn 8 novel homophones and 16 nonce words in their native language. Each novel homophone was built by assigning a novel object as a new meaning to an existing mono-meaning word, and each nonce word was built by assigning a novel object as a meaning to a non-existing phonological form, which is phonologically licit in all three languages. We came up with meaningful descriptions of those novel objects to help participants learn the novel words during training sessions. We also included 8 existing mono-meaning words as the baseline for analyses. All stimuli were non-reduplicated bisyllabic sequences. We used the same set of nonce words and controlled for the phonotactic structure of novel homophones and familiar words across the three languages.
Participants were seated in a sound attenuated dim room facing a monitor. An SMI remote binocular eye tracker attached to the monitor was used together with iView X and SMI Experiment Center. After calibration and validation, visual instructions in the native language of each participant were presented before the experiment started. Throughout the experiment, participants were instructed to fixate at the center of the screen for 500ms to proceed to the next trial. There were in all three training sessions. In each trial of the training sessions, participants heard a sentence describing the object that was shown at the center of the screen. The word for the object, which is either a novel homophone or a nonce word, was embedded in the sentence. The participant was instructed to memorize the word with the object as its meaning. A test session was given after each training session. In each trial of the test sessions, participants heard a novel homophone, a nonce word or a familiar word and saw four images, including one target and three competitors, each at a corner of the screen simultaneously. Participants were instructed to look at the image that matched the word they heard. The order of trials was randomized within sessions. Each test trial lasted for 5000ms. 4.2. RESULTS. We used the proportion of looking time (PLT) towards the target in test sessions as the dependent variable for data analyses. This measure is computed by dividing the looking time towards the target by the total amount of looking time.
We first averaged PLT of entire trials across different stimuli within each target category (familiar words, novel homophone, nonce words) and performed t-tests to preliminarily examine the results of word learning. The results are illustrated in Figure 3.

Figure 3. PLT in each test session divided by target type and language
We plotted the proportion of PLT that was above chance level (25%). In the first test session, the PLT towards novel homophone targets tended to be higher in Japanese (M = .587, SD = .095) than in English (M = .505, SD = .118; t = 1.897, df = 19.929, p = .072) and was significantly higher than in Mandarin (M = .463, SD = .146; t = 2.561, df = 22.346, p = .018), but there was no robust evidence for any difference between English and Mandarin (t = .886, df = 28.746, p = .383). In the second test session, while the PLT towards novel homophone targets greatly increased in all three language groups (Japanese: M = .687, SD = .046; Mandarin: M = .626, SD = 0.110; English: M = .659, SD = .078), the cross-linguistic differences in PLT towards novel homophone targets decreased such that there was only marginally significant difference between Japanese and Mandarin (t = 1.951, df = 21.762, p = .064), and the differences between Japanese and English (t = 1.127, df = 22.901, p = .271) and between English and Mandarin (t = .993, df = 27.013, p = .329) were not significant. After the participants finished all training sessions, their performance was mostly at ceiling, and no cross-linguistic difference was significant in the third test session (Japanese vs. English: t = .486, df = 14.356, p = .634; Japanese vs. Mandarin: t = 1.459, df = 19.819, p = .160; English vs. Mandarin: t = 1.264, df = 26.389, p = .217).
As the cross-linguistic differences in the results of homophone learning was highlighted in the first test session but faded quickly in subsequent test sessions as a result of participants reaching the ceiling, our following growth-curve analyses (Mirman, 2014) focused on the results in the first test session. We also excluded the data of Mandarin-speaking participants because the PLT toward familiar words in Mandarin group was significantly lower than that in Japanese group (familiar words: t = -2.154, df = 22.911, p = .042), while we expected participants from different language groups to be equally good at recognizing familiar words.
We ran the analysis using nonce words as the baseline, by taking the difference in PLT within each time frame between novel homophones and nonce words, to diminish the potential differences in learning ability across language groups. We fitted a mixed effect linear regression model, with difference in PLT as dependent variable, orthogonal linear and quadratic polynomials of the variable for time (Hereafter, "Time Frame", as the eye-tracker takes 60 frames of the screen and locates the position of eye gaze within one second), language as the fixed effect, the interactions between language and polynomials, and participants as the random effect, to the data using lme4 package (Bates et al., 2015) in R (R Core Team, 2015). We discarded the data within the first second (before frame 62) and after 2.5 seconds (after frame 151), as participants' gaze hardly showed any movement within the first second and consistently fixated on the target after 2.5 seconds. The time-course curves of the difference in PLT between nonce words and novel homophones are plotted in Figure 4, and the results of regression are shown in  Table 3. Growth curve analysis of PLT towards novel homophones (baseline: nonce words) The results reported a marginally significant fixed effect of language, suggesting that the PLT towards homophones over nonce words tended to be higher in Japanese group than in English group. The results of model comparisons also confirmed that adding the fixed effect of language significantly improved the goodness of fit (χ 2 (1) = 4.508, p = .034), though further adding interactions did not (with linear polynomial: χ 2 (1)=0.237, p = .627, with both polynomials: χ 2 (1) = .011, p = .916) 5. General discussion. In this study, we first examined one of the potential consequences of the drastic differences in phonological resources that different languages make available, namely, the varying distribution of homophony across languages. Our corpus analyses showed that languages have homophony ratios that can differ by up to an order of magnitude, and this crosslinguistic difference in homophony ratios is qualitatively correlated with the differences in phonological resources. We proceeded to explore if varying homophony ratios could lead to crosslinguistic differences in homophone auditory processing. In an auditory lexical decision task, we found that while homophony brought difficulty in processing to native Mandarin speakers and native English speakers, the performance of native Japanese speakers seemed to be immune from the influence of homophonous mates. This cross-linguistic difference was further corroborated by our finding in a word-learning task, that native Japanese speakers tended to learn artificially created homophones faster than native English speakers. The auditory lexical decision task produced results that can yield meaningful comparisons with previous results in visual lexical decision tasks. First, we would like to point out that, the non-significant results reported in our first regression, where homophone condition was coded as a binary variable, were actually not so surprising (see Clark (1973) for discussion on treating stimulus items as a random effect). Despite the reported lack of robustness, we still managed to find evidence for significant homophone disadvantages within native Mandarin speakers and native English speakers, as well as significant interaction that confirmed convincing crosslinguistic differences in homophone effects between Japanese and the other two languages when homophony was coded as the exact number of meanings associated with the phonological representation. This contrast suggests that the homophone disadvantage in Mandarin and English is incremental, with each additionally activated lexical entry or semantic representation further intensifying the competition among lexical entries or semantic representations that are compatible with the phonological representation, thus causing additional difficulty to the whole process of lexical decision.
While our results replicated the English disadvantage that has been reported in visual processing in existing studies (e.g. Burt & Jared, 2016;Pexman & Lupker, 1999;Pexman et al., 2001;Pexman et al., 2002;Rubenstein et al, 1971), our results also reported a disadvantage in Mandarin that is in conflict with the advantage in visual processing found in several previous studies (e.g. Chen et al. 2009;Ziegler et al. 2000). A plausible explanation for this discrepancy would be that the inconsistency is due to the close mappings between orthographic forms and semantic meanings of Mandarin words. However, some close consideration would disclose the gap: the close mappings could only explain why the processing of homophones should not be delayed, but not why it should be fast, as close mappings also hold for mono-meaning words. As the findings on homophone effects have been largely inconsistent even within visual processing (e.g. Chen et al., 2016;Wang et al., 2012), further studies are needed to explore additional mechanisms or systematic differences in the design of existing studies that could account for such disparities.
Our findings of null effect, or at best a trend of advantage regarding homophone processing in Japanese, weakly echoed the reported advantage in processing multi-mate homophones (Hino et al., 2013), but challenged the findings of a disadvantage in visual processing (Tamaoka, 2007;Mizuno & Matsui, 2016). Similar to the situation in Mandarin, the existing inconsistencies regarding homophone effects in visual word processing prevents us from any conclusive argument within Japanese.
Nevertheless, this study, as perhaps the first attempt to directly compare the effects across three languages that differ significantly in the proportion of homophonous words, has indeed provided evidence confirming cross-linguistic differences in homophone processing that correlate with differences in homophony ratio. Native Japanese speakers might not process homophones any faster than mono-meaning words, but they do not suffer from any delay in processing comparable to those found among native Mandarin speakers and native English speakers. The difference between native Japanese speakers and native English speakers was in addition corroborated by the findings from the word-learning task, that Japanese speakers also learn artificially created homophones faster than English speakers do. We hypothesize that, as consequences of high homophony ratio, native Japanese speakers, on the one hand, have the pressure to process homophones fast to keep communication going, and on the other hand, have much more experience with resolving the difficulty in processing brought by homophones.
While the results provided convincing evidence for cross-linguistic differences in homophone processing, we note that the findings in these studies are limited to the processing of isolated words with no context provided. Moreover, if we assume that the better performance of native Japanese speakers reported in this study is a result of their rich experience, we must also acknowledge that such experience is almost exclusively with processing homophones that are embedded in rich contexts; how that experience can be put into use in isolated word processing remains unknown. Therefore, our next step is to explore how the addition of linguistic context might alter findings regarding homophone processing within and across languages, and what relations possibly exist between isolated word processing and in-context word processing, or, namely, ambiguity resolution.
In summary, in this study we found that, languages differ significantly regarding the proportion of words that are homophonous, and such cross-linguistic differences are closely correlated with differences in the phonological resources that each language makes available. Furthermore, differences in homophony ratios strongly correlate with differences in processing and learning homophones across native speakers of different languages.