Linguistic Annotations for a Diachronic Corpus of German

Erhard Hinrichs, Thomas Zastrow

Abstract


This paper describes the Ta-D/DC, a diachronic corpus of German that uses selected materials from the German Gutenberg Project and enriches them with different linguistic annotation layers, including part-of-speech, lemmata, and constituent structure. Linguistic annotation is performed automatically by using statistical tools that have been trained with data from the Tinger Baumbank des Deutschen (Ta- D/Z). In order to assess the annotation quality, an evaluation of the POS tagging is performed on the basis of a data sample of texts that range from the 13th to the 20th century. The paper concludes with a description of three different query mechanisms provided for the user.

Keywords


treebank;German; linguistic annotation

Full Text: PDF