Bootstrapping the Development of an HPSG-based Treebank for Persian

Masood Ghayoomi


In this paper, we describe an ongoing research to develop an HPSG- based treebank for Persian. To this aim, we use a bootstrapping ap- proach for the data annotation. In the first step, a set of seed rules are defined as regular expressions in the CLaRK system. Then, the data is shallow processed with this set of rules. In the next step, a human annotator completes the annotation of sentences manually. To increase automatic annotation, we extract the manual applied rules and iteratively augment the seed rules with the rules applied frequently in the manual annotation. Our experiment in building the Persian tree-bank which currently contains 1000 sentences shows that the proposed method reduces human intervention from 74.05% in first iterations to 39.01% in last iterations. 


HPSG; Persian; ClaRK; annotation

