The Linguistic Status of Predictions and Feature Ranks from SVM Text Classifiers
Text classification systems are capable of predicting certain characteristics of a text's author (e.g., gender and age) using only linguistic properties. This paper asks why such predictions are possible and how they can be interpreted. There are three factors: (1) the nature of the features used by the system; (2) the robustness of the predictions across time and genres; (3) the amount of data required for training and testing. Some classification predictions (e.g., gender) are based on non-content linguistic material that generalizes across time and genre. These classifications are characterized by stable performance and feature ranks, and permit linguistic interpretation.
Published by the LSA with permission of the author(s) under a CC BY 4.0 license.