Speech tested for Zipfian fit using rigorous statistical techniques

Paul De Palma, Leon Antonio Garcia-Camargo, Jeb Kilfoyle, Mark Vandam, Joseph Stover


Zipf’s law describes the relationship between the frequencies of words in a corpus and their rank. Its most basic form is a simple series, indicating that the frequency of a word is inverselyproportional to its rank:

1/2, 1/3, 1/4,...

The past two decades have seen the emergence of usage-based and cognitive approaches to language study. A key observation of these approaches, along with the importance of frequency, is that speech differs in substantial and structural ways from writing. Yet, except for a few older analyses performed on very small corpora, most studies of Zipf’s law have been done on written corpora. Further, a judgement of Zifianness in much of this work is based on loose and informal criteria.  In fact, sophisticated statistical techniques have been developed for curve fitting in recent years in the mathematics and physics literature. These include the use of the Kolmogorov-Smirnov statistic, along with maximum likelihood estimation to generate p-values and the use of the complementary error function for normal distributions. The latter helps determine if a corpus, failing a Zipfian fit, might be better described by another distribution. In this paper, we will:

  • Show that three corpora of recorded speech follow a power law distribution using rigorous statis- tical techniques: Buckeye, Santa Barbara, MiCase

  • Describe preliminary results showing that the techniques outlined in this paper may be useful in the diagnoses of those conditions that can include disordered speech.

  • Explain how to do the analyses described in this paper.

  • Explain how to download and use the R/Python code we have written and packaged as the Zipf Tool Kit


computational linguistics; Zipf; ASD

Full Text:


DOI: https://doi.org/10.3765/plsa.v6i1.4975

Copyright (c) 2021 Paul Anthony De Palma, Leon Antonio Garcia-Camargo, Jeb Kilfoyle, Mark Vandam, Joseph Stover

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Donate to the Open-Access Fund of the LSA

Linguistic Society of America

Advancing the Scientific Study of Language since 1924

ISSN (online): 2473-8689

This publication is made available for free to readers and with no charge to authors thanks in part to your continuing LSA membership and your donations to the open access fund.