C-ORAL-ROM: Integrated reference corpora for spoken Romance languages. Ed. by Emanuela Cresti and Massimo Moneglia. (Studies in corpus linguistics 15.)
C-ORAL-ROM presents corpora of spontaneous speech of French, Italian, Portuguese, and Spanish collected in
Ch. 1 deals with the C-ORAL-ROM resource in general. The corpora consist of approximately 300,000 words for each of the four languages and include recordings and texts from a wide variety of contexts, genres, and dialogue structures. Available in the accompanying DVD and through the ELDA Catalogue (http://www.elda.fr), the corpora are presented in a multimedia format that includes both textual and acoustic information. The textual information, which follows the CHAT format (MacWhinney 1994), is prosodically tagged and annotated for part of speech. A key feature of C-ORAL-ROM is text-to-speech alignment, which is a function of the selection of each utterance in the resource through prosodic cues. The resource provides text-to-speech synchronization of roughly 130 hours of spontaneous speech.
Chs. 2–5 focus on the subcorpora for each language, and Ch. 6 provides some discussion of the important role of the utterance—defined as an ‘expression marked by a prosodic terminal break’ (210)—in speech-corpora analysis. Finally, the appendix briefly presents the results from the external evaluation of the prosodic annotation utilized in the project.
The DVD offers several tools. The corpus metadata provides metalinguistic information for each language sample. Glossaries are included for Italian regional forms and Spanish nonstandard forms. Text-to-speech alignment is provided through a demo version of the WinPitch Corpus (© Philippe Martin), where recordings can be listened to and analyzed acoustically with the help of waveforms, spectrograms, and pitch tracking. This is especially helpful for prosodic analysis. A text search engine is also provided, through a demo version of Contextes (1.1.0) (© Jean Véronis). Every match returned for word or lemma searches includes a partial context; the script where the match appears can be uploaded with a simple click. Frequency lists for words and lemmas for each of the subcorpora are also included in the DVD, together with tables and comparative diagrams of relevant linguistic measures and strategies in each language.
Overall, this is a great resource for researchers in the areas of Romance linguistics, corpus linguistics, syntax, second language acquisition, and speech and prosody research. The operation of the DVD and the tools included in it is quite straightforward. The exception is the WinPitch Corpus, for which a troubleshooting section and additional information on its operation would be a welcome addition. An online tutorial for this program is announced at http://lablita.dit.unifi.it/coralrom/. Finally, it is unfortunate that one of the key options in Contextes—playing the context for each match returned through the search function—is not supported in the demo version distributed in the DVD.