Corpus linguistics | Book Notices

Corpus linguistics: Readings in a widening discipline. Ed. by Geoffrey Sampson and Diana McCarthy. London: Continuum, 2005. Pp. xv, 524. ISBN 9780826488039. $60.

Reviewed by Carmela Chateau, Université de Bourgogne

Corpus linguists generally start their careers as linguists or computer scientists. Researchers from vastly different backgrounds will find this book of great assistance in learning more about the sources of the discipline, and it will prove invaluable for students just starting out as corpus linguists. This reader contains forty-two key texts in chronological order spanning fifty years. Besides a general introduction, each paper has a brief introduction setting it in historical context.

The first article predates electronic corpora: Charles Carpenter Fries (1952) used recordings of telephone conversations (about 250,000 words) to investigate the structure of English in use. The subcorpus of 72,000 words used by F. G. A. M. Aarts (1971) contained some spoken texts. Bengt Altenberg (1986) worked on spoken English, to chunk language naturally as part of a Text-to-Speech (TTS) program; Louis C.W. Pols et al. (1998) explored the use of authentic corpora to improve such programs. Peter C. Collins (1987) examined differences between spoken and written English using the Lancaster-Oslo-Bergen (LOB) and London-Lund one-million-word corpora, constructed along the lines of the Brown corpus presented by W. Nelson Francis (1965). Geoffrey Leech and Roger Fallon (1992) examined the cultural aspects revealed by the Brown (American) and LOB (British) comparable corpora.

Corpus construction was discussed by John Sinclair (1987), a key figure in the creation of the Collins Birmingham University International Language Database (COBUILD) Bank of English. Douglas Biber (1992) showed how statistics can be used to confirm the representativeness of a corpus. Statistics were brought into play by William Gale and Kenneth Church (1989) and by Peter F. Brown et al. (1990), investigating parallel corpora for machine translation. Jean Carletta (1996) suggested using the kappa statistic to assess interannotator reliability. Donald Hindle and Mats Rooth (1993) investigated parsing, finding that in some cases there could be no single correct answer.

The treebank approach to parsing corpora was presented by Mitchell P. Marcus et al. (1993). E.J. Briscoe and J.A. Carroll (1995) evaluated a probabilistic parser. The topic of treebanks was discussed in greater depth by Eugene Charniak (1996) and by Geoffrey Sampson (1999). Another approach to treebanks, for Czech, was presented by Alena Böhmová and Eva Hajičová (1999). A Swedish corpus was discussed by Staffan Hellberg (1991), and Anthony McEnery (2001) made the case for corpus research into nonindigenous minority languages (NIMLs). Estelle Campione and Jean Véronis (2001) presented spoken French corpora, semiautomatically tagged for intonation. Esther Grabe and Brechtje Post (2002) looked at Intonational Variation in English (IViE). Ossi Ihalainen (1991) also investigated a British dialect, while Jan Tent and France Mugler (1996) discussed the creation of the Fiji component of the International Corpus of English (ICE).

Douglas Biber and Edward Finegan (1987) examined English from a historical viewpoint, as did Matti Rissanen (1991). Various idiosyncratic aspects of spoken English were also investigated: Ingrid Kristine Hasund and Anna-Brita Stenström (1996) looked into female disputes; Anthony McEnery et al. (1998) focused on swearing; Christopher C. Werry (1996) examined Internet Relay Chat (IRC); David McKelvie (1998) studied disfluency; and Mark G. Core (1998) investigated the use of Dialog Act Markup in Several Layers (DAMSL) utterance tags to explore speech acts.

Gavin Burnage and Dominic Dunlop (1992) were involved in encoding the British National Corpus (BNC). Jean Carletta et al. (2000) used XML for linguistic annotation. L.W.M. Bod and R.J.H. Scha (1996) provided an overview of data-oriented language processing. Gill Francis (1993) and William Louw (1993) used the COBUILD to produce a new, corpus-driven grammar of English and to investigate semantic prosody, respectively.

Corpora have also been used to produce dictionaries for language learners. Philip Resnik and David Yarowsky (1997) discussed word sense disambiguation. Patrick Hanks (1986) investigated meaning potentials. Kenji Kita et al. (1994) used corpora for the automatic extraction of collocations for language learning. Dieter Mindt (1996) investigated corpus linguistics and foreign-language learning. Kenneth Hyland and John Milton (1997) explored differences in native speakers’ and second language learners’ writing. Finally, Adam Kilgarriff (2001) explored the twenty-firstcentury trend, web-as-corpus.