Language ID for a Thousand Languages

Fei Xia; Carrie Lewis; William Lewis

doi:10.3765/exabs.v0i0.504

Authors

Fei Xia University of Washington
Carrie Lewis
William Lewis Microsoft Resarch

DOI:

https://doi.org/10.3765/exabs.v0i0.504

Abstract

ODIN, the Online Database of INterlinear text, is a resource built over language data harvested from linguistic documents (Lewis, 2006). It currently holds approximately 190,000 instances of Interlinear Glossed Text (IGT) from over 1100 languages, automatically extracted from nearly 3000 documents crawled from the Web. A crucial step in building ODIN is identifying the languages of extracted IGT, a challenging task due to the large number of languages and the lack of training data. We demonstrate that a coreference approach to the language ID task significantly outperforms existing algorithms as it provides an elegant solution to the unseen language problem. We also discuss several issues that make automated Language ID and the maintenance of ODIN very difficult.

Author Biography

Fei Xia, University of Washington

Assistant Professor Department of Linguistics University of Washington PO Box 354340 Seattle, WA 98195

Language ID for a Thousand Languages

Authors

DOI:

Abstract

Author Biography

Downloads

Published

Issue

Section

License

Information