Corpora of Non-Linguistic Symbol Systems

Katherine Wu, Jennifer Solman, Ruth Linehan, Richard Sproat


The popular press has promoted recent claims of statistical methods that can
distinguish writing from non-linguistic symbols. One problem with such claims,
though, is the dearth of non-linguistic symbol "texts" which could be compared
with written language. This project fills that void by developing electronic
corpora of known non-linguistic systems. To date we have developed corpora of
several systems: heraldry; Totem poles; Mesopotamian deity symbols; Vinča
symbols; Pictish symbols; mathematical equations; weather icons. Corpus sizes
range from several hundred to several tens of thousands of symbols. All corpora
are encoded in XML and will be released under an open-source license.

