De re interpretation in belief reports—An experimental investigation

Determiner phrases (DPs) under intensional operators give rise to multiple interpretations, known as the de re/de dicto ambiguity. Formal theoretical approaches to modeling this ambiguity must rely on nuanced semantic judgments, but inconsistent judgments in the literature suggest that informal judgment collection may be insufficient. In addition, little is known about how these ambiguities are resolved in context and how preferences between these readings vary by context and across individuals, etc. We reported three controlled experiments to systemize the truth-value judgment collection of de re/de dicto readings. While the de dicto readings were robustly accepted by nearly all English speakers, de re readings exhibited strongly bimodal judgments, suggesting an inherent disagreement among speakers. In addition, the acceptability of de re judgments was affected by the DP’s internal structure as well as idiosyncratic scenarios. More broadly, our experimental results lend support to the practice of including quantitative data collection within semantics.

cific person who is a prince in the actual world, even if she does not realize that he is the prince. What we will call here its de dicto reading is true when Aurora's desires include "marrying the prince", even if she is mistaken about who the prince may be.
In this paper, we examine current semantic judgment collection methods for formal theories of the de re/de dicto ambiguity. We report three novel experiments which highlight some advantages of obtaining quantitative judgments for readings with this ambiguity, focusing on definite DPs, given that they have already generated some judgment inconsistency in the literature. In the first section, we lay out the existing theoretical landscape and raise the motivation for the quantitative approach. In the second section, we report three controlled truth-value judgment experiments that tested whether the inconsistency also existed among multiple native English speakers. In the final section, we endeavor to account for the observed judgment variation.
1.1. A BRIEF OVERVIEW OF CURRENT THEORETICAL FRAMEWORKS. A traditional approach to the de re/de dicto distinction models it as a scope ambiguity: in the logical form de re (but not de dicto) DPs outscope the intensional operator to obtain the reading assigned in the actual world, as in (1) (see Russell 1905, Fodor 1970, Cresswell & von Stechow 1982. (1) Aurora wants to marry a prince.
de re LF: While this theory can capture the distinction for simple (indefinite) DPs as in (1), when they appear under quantification the outscoped de re reading diverges nontrivially from the intended reading. In response, Percus (2000) proposes a solution using situation pronouns. Every verb phrase (VP) and noun phrase (NP) takes a situation pronoun as an unpronounced intensional variable that is bound by a higher lambda abstractor to get its world assignment. In this way, DPs can attain the de re interpretation via the logical binding of intensional variables while remaining in-situ. However, situation pronouns overgenerate, predicting readings not attested in natural language, resulting in a proposal by Keshet (2008Keshet ( , 2011 known as split intensionality, a return to the traditional scope-based theory with the employment of a type-shifting operator associated with an intensional operator. Raising a DP above this operator but below the intensional operator not only makes the DP an intensional argument, but also assigns it a different world index from the matrix clause. Consequently, DPs can be interpreted de re once they are raised above the type-shifting operator while remaining in the scope of the intensional operators. However, neither the scope-based theories nor the situation pronoun theory can adequately account for the "multiple-guise" scenario raised by Quine (1956), and consequently the theory of de re readings has been enriched by the addition of concept generators (see (4) below), permitting de re readings without movement (Percus & Sauerland 2003, Anand 2006, Charlow & Sharvit 2014. The de re/de dicto distinction illustrates how formal theories evolve in response to new pieces of linguistic observation. Importantly, these linguistic observations not only include the well-formedness of a sentence, but also its truth-value judgment offered by linguists given a corresponding scenario. Given the complexity of the data (especially the context/scenario) involved in de re/de dicto judgements, we find it unsurprising that there is some judgment inconsistency from existing publications. We review some inconsistencies in the next section, and conclude that they call for a more consistent judgment collection approach, an approach that generates, in Tonhauser and Matthewson (2015)'s term, "stable, replicable, and transparent" judgments for observations of the same kind in the collection of such layered semantic data.
1.2. THE NEED FOR QUANTITATIVE RESEARCH. Inconsistencies found in linguistics and relevant fields constitute our motivation for the current study. In the linguistic theory literature, there exist some inconsistent de re judgments about the same DP structure in nearly identical linguistic environments. For instance, von Fintel and Heim (2011) employ sentence (2) to argue that the DP your abstract exhibits genuine ambiguity of de re and de dicto interpretations, while Nelson (2019) argues that the target DP her brother in (3), which has the same internal structure as your abstract, cannot be interpreted de re.
(2) John believes that [your abstract]de re will be accepted.
de re truth condition: John reviewed an amazing abstract and thought that it will be accepted. The speaker of this belief report has the additional knowledge that the abstract is written by the addressee "you" and thus utters (2) Nelson's reason to deny the de re reading is that the belief holder Sally does not conceive of the person-Sally's brother in real world-as her brother. Nelson takes Sally's perspective and argues that de re should not be true given the scenario, while von Fintel and Heim claim that de re readings are a regular natural language phenomenon. One may wonder whether Nelson's reasoning is representative of a dispreference for de re: to test this, it helps to gather judgment data across a wider variety of scenarios. Or, perhaps particular linguistic features, e.g., your vs. her, or the inanimate vs. animate possessees, contribute to the unexpected judgment inconsistency. Or, perhaps individuals vary in their preferred readings when faced with ambiguity. Controlled scenarios, examples, and a broader population allow us to test these hypotheses. Further inconsistency in the linguistics literature arises in more complex cases. Charlow and Sharvit (2014) note one such disagreement: while they claim that the possessee mother in (4) should be de dicto, they report that another linguist who works on the same phenomenon finds it more natural under a de re reading. Truth condition of a "multiple-guise" scenario: John comes into contact with every actual female student more than once, and each actual female student appears each time in a different guise. The same woman appears in two different guises and John fails to recognize this. He thinks he came into contact with two different women. Furthermore, in John's mind, the mapping between the different guises is one-to-one. The sentence is about a specific scenario when John believes that a likes b's mother, c likes d's mother, and e likes f's mother when in reality, a = b, c = d, and e = f. Outside of theoretical linguistics, real-life instances provide data that are occasionally unexpected given theoretical predictions. While it is claimed that cardinal DPs cannot be interpreted de re (Musan 1995, Keshet 2008, Romoli & Sudo 2009), sentence (5), an utterance collected at an economic conference reported on Language Log by Liberman (2005), suggests otherwise.
de re truth condition: There were 12 journalists killed by the U.S. forces in an attack but the forces did not know the people they killed were journalists. (Liberman on October 23, 2005) Moreover, outside linguistics, researchers in related fields (i.e., philosophy, psychology, law) have claimed that the de re reading is easier to obtain in scenarios where both de re and de dicto are admitted, which is not (as far as we know) a claim that has been made within linguistics. In philosophy, Jaszczolt (1997) maintains that the directly referential property of definite noun phrases is more salient in communication and thus argues for a "default de re reading". Her perspective finds its allies in cognitive science and developmental psychology. For example, when a participant and a character/protagonist in an experiment both know the identity of an object but the protagonist remains partially ignorant of the object's certain properties, both children and adult participants fail to restrict their description to the properties already known by the protagonist to refer to the object when put into the protagonist's shoes (Mitchell et al. 1996, Apperly & Robinson 2003, Apperly & Butterfill 2009, Low & Watts 2013. These observations suggest an egocentrism or reality bias explanation and the easiness of accessing information in actual reality but not others' mental status may bias one to expect something like a "default de re" hypothesis. This bias is also bolstered in legal settings where the focus on a literal interpretation of the defendant's action rather than his intention to conduct such action during jury procedure echoes the "default de re" claim (Anderson 2013).
Despite the observed judgment inconsistency in linguistics literature and the "default de re" claim outside linguistics, there has been no experimental work, that we are aware of, that directly looks at the judgment preferences for de re/de dicto readings in a given scenario. While Hackl et al. (2009) has studied transparent versus opaque readings in intensional transitive predicates using online reading times and gathered evidence supporting the scope-based theory over the situation pronoun approach, their finding-QRed transparent DPs facilitate the processing of the following ACD site-does not directly address the judgment inconsistencies introduced above.
Fortunately, crowdsourcing techniques offer a systematic solution. By gathering offline judgments from multiple native speakers, we can detect whether the observed inconsistency results from idiosyncratic noises or inherent disagreement; by creating multiple scenarios to test a single phenomenon, we can confirm whether the judgment is robust to more variation. Crucially, by comparing the de re and de dicto readings of the same sentence under minimally different contexts, we can attempt a fair comparison of the judgment pattern and hopefully understand factors involved in different preferences for these readings. In sum, there is good reason to believe that the de re/de dicto literature can benefit from more quantitative methods.
1.3. RESEARCH GOAL. We aim to set up a simple and efficient experimental template for systematically obtaining stable, replicable, and transparent judgments of de re/de dicto readings across carefully controlled scenarios. We focus on definite DPs since several above examples with questionable judgments are definite (e.g., your abstract, her mother), although we note that this definite non-de re reading differs from the traditional de dicto exhibited by indefinite DPs. 2. Experiment One. Experiment 1 used highly controlled scenarios that permitted both de re and de dicto readings of definite DPs to probe its judgment pattern from native English speakers.
2.1. PARTICIPANTS. 120 adult native English speakers were recruited through Amazon's Mechanical Turk. They received $2 compensation for finishing the experiment.
2.2. MATERIALS AND DESIGN. In this experiment, participants read four written scenarios and gave their judgment of a target sentence based on each scenario (an example in Table 1 2 ).

CONTEXT
Julie is one of the judges of an ongoing poetry competition. The best poem that she has read so far is an extremely intriguing poem about the ocean. She believes that this poem will win the competition. Julie remembers being told that Nicole, one of the bestknown poets, submitted a poem about the ocean to the competition. Therefore, Julie concludes that this poem must be written by Nicole and the first prize will be going to her. However, this poem was actually written by Elizabeth, a younger and lesser-known poet. It is just a coincidence that the two poets wrote about the same topic.
JUDGMENT QUESTION According to this story, please use the slider bar to indicate to what extent you agree or disagree with the following statement.

Target Sentence I:
Julie believes that Elizabeth's poem will win the competition. (de re) Target Sentence II: Julie believes that Nicole's poem will win the competition. (de dicto) In each scenario, there were two terms that described the target object (e.g., poe m). The protagonist (e.g., Julie) associated one term X (e.g., Nicole's poem) with the target object but in reality, X was not correct and the correct descriptive term Y (e.g., Elizabeth's poem) was not known by the protagonist. If the wrong term X was used in reporting the protagonist's belief, a de dicto reading would emerge; if the correct term Y was used a de re reading would emerge. Given this scenario, both readings were predicted to be true (e.g., Romoli & Sudo 2009).
In each scenario, the participants read one of the two target sentences (Target Sentence I or Target Sentence II, varied between participants) and dragged a slider bar to show the extent to which they agree or disagree. After the participants' decision, a numeric judgment score was recorded (from "highly agree" = 100 to "highly disagree" = −100). Three sanity check sentences were additionally provided in each scenario-one was definitely correct, one was definitely wrong, and the last was uncertain. Successful judgments on these sentences indicated the partici-pants were attentive and thus eligible for inclusion in the data analysis. The advantage of a slider bar is its greater sensitivity to reveal potential judgments that would otherwise stay concealed due to the strong categorical implication in designs like the binary or the Likert scale (Marty et al. 2020).
Each participant read four scenarios and each scenario was coupled with four sentences for judgment elicitation. Two of the four scenarios were randomly chosen for the de re condition and the other two for the de dicto condition. The participants were randomly assigned to one of six lists created for Latin Square design. The order of the four stories was randomized, as was the order of the four sentences within each scenario. The entire survey was created on Qualtrics and distributed on Amazon's Mechanical Turk.
2.3. RESULTS. We analyzed only the responses from participants who correctly judged the correct and incorrect sanity checks at least 75% of the time. 115 participants' data (95.8%) were retained for the analysis.
The histogram in Figure 1 shows the judgment distribution of de re/de dicto sentences across all scenarios. While judgments for de dicto readings overwhelmingly aggregate toward the "highly agree" end, judgments for de re readings are bimodal-although more than half of the judgments are agreed with, another sizable proportion goes to the "highly disagree" edge.
We further analyzed the agreement proportion in each scenario, assuming it was appropriate to treat the continuous judgment as a binary variable given its categorial distribution. In Figure 2, the de dicto agreement rates are at ceiling for all scenarios while de re judgments exhibit larger variability across scenarios with a unanimously lowering effect ( 2 = 79.13, df = 1, p < .001). The visual difference of de re/de dicto condition was confirmed by a mixed-effects logistic regression analysis. By treating the de re/de dicto conditions and the scenario as sum-encoded fixed effects with a random intercept on participants, we found that the de re trials were significantly less likely to be agreed with compared with a random trial (β = −1.61, SE = 0.23, p < .001); additionally, Scenario b had a significantly higher agreement rate (β = 0.96, SE = 0.30, p = .001) while Scenario c had a significantly lower one (β = −0.81, SE = 0.24, p < .001) 4 .
Additionally, we explored whether groups of participants had different judgment behavior, exhibited in Table 2. Clearly, a preponderance of participants agreed with both de dicto trials while the judgment behavior for de re had three representative groups, suggesting that an inherent disagreement among speakers or an unplanned scenario effect may both contribute to this distinctive participant behavior of interpreting de re. 5 2.4. DISCUSSION. By setting up scenarios that theoretically allow both de re and de dicto readings of definite DPs and eliciting native speakers' judgments, we found that while de dicto readings were unanimously available to participants, de re readings exhibited bimodal judgments with larger variations across scenarios and speakers. The sizable disagreement proportion and bimodal pattern suggest systematicity within the previously observed inconsistency in the literature.

Experiments Two and Three.
While Experiment 1 probed the judgement distribution for de re and de dicto readings of definite DPs (in particular, possessive constructions) in relatively simple sentences, Experiments 2 and 3 asked whether the judgment disparity could extend to other DP structures or more sophisticated sentences. Driven by such kind of motivation, we studied the nuanced case of bound de re observed in Charlow and Sharvit (2014) for sentences like John believes that every female studenti likes heri mother (above in (4)). The crucial bound de re assigns the QNP every female student and the possessive pronoun her a de re reading. The reading of mother is less critical for theoretical choices, but given these sentences were reported to have inconsistent judgments, we decided it was also worth investigating.
3.1. PARTICIPANTS. 160 participants in Experiment 2 and 128 in Experiment 3 took the tasks for $2 on Amazon's Mechanical Turk. After applying the same filter as in Experiment 1, 127 participants (79.38%) in Experiment 2 and 120 (93.75%) in Experiment 3 contributed to the analysis 6 .
3.2. METHODS. We treated the de re/de dicto reading of QNP as a between-experiments manipulation (de re in Experiment 2 and de dicto in Experiment 3) so that the possessive pronoun and the possessee, serving as two within-subjects manipulations, took either de re or de dicto readings within one experiment. Focusing on a 2 X 2 within-subjects manipulation within an experiment prevented participants from reading more than four complex scenarios and getting fatigued. The general design was nearly the same as Experiment 1. The most significant difference is that while Experiment 1 had each scenario allow both de re and de dicto readings, Experiments 2 and 3 were designed such that each scenario supported (i.e., made true) one reading, with the target sentence held constant and the scenario manipulated across conditions. There were also illustrative pictures to facilitate processing (see an example in Table 3). Furthermore, Experiments 2 and 3 had the same randomization, counterbalance, and filler design as Experiment 1. 5 Thanks to Alexander Göbel for pointing to this inter-speaker investigation. 6 The lower inclusion rate in Experiment 2 was because those participants were recruited on weekends when it is more challenging to gather good data via online implementation.

CONTEXT
As a photographer, John likes to rearrange his collections of photographs. One day, he encounters two sets of photos. In the first set, there are three ladies and each is holding a baby. John naturally believes that each of the babies is being held by their mother.
In the second set, three young adults are each wearing a T-shirt with a "2018" logo. John naturally believes that they were graduating students in the year 2018. John also notices that, interestingly, each of the young adults shares a similar smile to one of the ladies in the first photo set. He tries to recall if there is a connection between the young adults and the ladies but memory fails him.
As a matter of fact, what John fails to recall is three pieces of information. (1) The young adults in the second set of photos were the babies in the first set. They've grown up! (2) The ladies in the first set are actually the babies' grandmother. The three young adults inherit their smile from their grandma who is mistakenly believed by John to be their mother. (3) The second set of photos were taken not in the graduation ceremony but when the three adults were volunteering for an academic conference in 2018.
Despite the fact that John doesn't remember the correct relationship between the ladies in the first set of photos and the young adults in the second set and that he has incorrect information, John spends some time appreciating these photos.
JUDGMENT QUESTION According to this story, please use the slider bar to indicate to what extent you agree or disagree with the following statement.
Target sentence: Looking at his photos, John believes that [every conference volunteer]de re in the second set has the same smile as [their]de re [mother]de dicto.  (2014) 3.3. RESULTS. Figure 3 shows that consistent with Experiment 1, judgments tend to gather around both scale ends. Crucially, there is still a salient proportion of disagreement for the bound de re case compared with the "control" condition where all three nominal constructions were de dicto. Figure 4 shows the agreement rate of the eight conditions in eight columns. The second column represents the canonical bound de re condition whose agreement rate is slightly above the chance level. Overall, the de re condition of the possessee leads to a lower agreement rate than the de dicto condition ( 2 = 23.63, df = 1, p < .001). Figure 5 displays the agreement rates by the condition manipulation and the scenarios. A visual examination shows a clear scenario effect because of the conspicuous lowering agreement rate in Scenario 4 whose peculiarity is neither expected nor designed.  (5) The effects of de re/de dicto manipulation were further analyzed via a logistic mixed-effects model. The maximal model had one random intercept on participants and three fixed-effects variables to indicate the de re/de dicto assignment of the three nominal terms. The fourth fixedeffect variable was the story plus an interaction term between the story and each of the three nominal terms 7 . All the fixed effects were sum-coded.

DISCUSSION
. Experiments 2 and 3 tested the judgment of de re and de dicto in the bound de re type sentence and found that the canonical bound de re structure ([QNP]de re, [Possessive pronoun]de re, [Possessee]de dicto) was agreed with more than 50% of the time, but at the same time (and, like de re readings generally) obtained salient disagreements. In general, Experiments 2 and 3 replicated the finding of Experiment 1 in showing that de re readings lead to bimodal judgment. An additional result here is that the agreement rate of de re appears dependent on the DP's internal structure and/or position-the effect of de re/de dicto variation on the agreement rate was significant for the possessive pronouns and possessees but nearly negligible for QNPs. Furthermore, the judgment was also affected by specific scenarios (e.g., peculiar Scenario 4).
Furthermore, the salient difference in de re and de dicto readings of the possessee doesn't support the challenge that a de re possessee is more natural, but rather is in line with Charlow and Sharvit (2014)'s main argument. The lack of effect for QNPs echoes numerous observations in theoretical work that both de re and de dicto readings for QNPs are felicitous (e.g., Mary 1978, Keshet 2008, Romoli & Sudo 2009. Going back to the claim of bound de re in Charlow and Sharvit (2014), these two experiments suggest that (a) bound de re does exist for many speakers, but also that (b) not everyone agrees.

Conclusion and Discussion.
In a series of three experiments, we provided some quantitative evidence that there is a truth-value judgment disparity between de re and de dicto readings of definite DPs, both in simple attitude reports and more complex ones. In particular, we showed that there is a systematic inconsistency for de re judgments: the bimodal distribution is far from the uniform or normal distribution, and we note that this pattern of bimodal agreement would have been impossible to detect without quantitative methods that used response options more sensitive than a binary true/false (here, we used a slider bar). This inconsistency occurred not only in Experiment 1 where the scenarios admitted both readings, but also in Experiments 2 and 3 where the complex scenarios were controlled to admit only a single interpretation: even when it was the only one supported/true in the given scenario, de re readings had bimodal acceptance while de dicto readings were overwhelmingly accepted. Other relevant factors that also appear to affect judgments include the internal structure of the DPs (e.g., possessive, quantificational, etc.) and features of the idiosyncratic scenarios.
While de re readings of combinations of other types of DP structures (including, crucially, indefinite DPs) and other kinds of attitude reports and intensional operators await testing, it is worth considering what may have led to disagreements in Experiments 1 to 3. One cause might be that participants possess different grammars or dialects and one variation disallows the de re reading. This has to do with grammar variation and without information about the participants' linguistic profile, this claim stays as a speculation. Another cause might be related to the scenario setup where the juxtaposition of de re and de dicto terms in the written scenario enhances participants' sensitivity to which term is used for reference in which possible world. The contrastive information evaluated in two parallel worlds could be well tracked by the participants and thus when their incremental comprehension starts from, for example, Julie believes that…, there is a chance that they only attend to what Julie believes and subsequently to descriptive terms held true in Julie's belief world. A de dicto DP naturally matches what the belief holder expects and thus is highly agreed upon, while encountering a de re DP whose referential relation to the entity is not held in Julie's mind could raise disagreement (as the case in example (3) raised by Nelson (2019)). This sensitivity to the contrast also alludes to Theory of Mind ability and reasoning with perspective shifting (Wimmer & Perner 1983, Apperly & Butterfill 2009, Low & Watts 2013. For further investigation along these lines, it may be informative to experimentally test whether the contrast of de re and de dicto terms in the scenario influences de re judgment. That is, if there is no contrastive term such as Nicole's poem vs. Elizabeth's poem in one scenario but just one de re term unknown to the belief holder, what might be the de re agreement rate? If agreement increases, then we may conclude that the contrastive information may contribute to the high disagreement proportion of de re. To conclude, our findings highlight the value of including quantitative methods as the basis for theoretical work, especially when the linguistic observation in question raises inconsistent judgments, as in the case of de re readings for definite DPs. The essential advantage of quantitative research is that with multiple speakers, multiple scenarios, and controlled manipulation, it is possible to detect whether an inconsistency observed from limited cases (e.g., sentences (2) and (3)) is noise or true disagreement, and whether a preference for one reading is due to a minor contrast/preference or a grammatical unavailability. Here, we have uncovered evidence that inconsistencies about the de re readings were not due to noise or uncertainty among participants (which would lead to more intermediate agreement responses) but rather to systematic bimodal judgments and affected by scenario and DP-specific factors. We speculate that our findings may be especially enlightening in the case of known semantic/scope ambiguities. What, then, to do with such results is an important question for theoreticians. We end by noting that these results did not even require researchers to create a massive number of scenarios to uncover these patterns: in Experiments 1 to 3, a mere four scenarios-just "a little bit experimental" in Davidson (2020)'s term-were enough to observe this systematic disagreement, which held across each of the experiments. Lastly, we hope that this work will lead not just to more work along the de re/de dicto line, but contribute to the growing field of Experiments in Linguistic Meaning (ELM).