The Linguistic Status of Predictions and Feature Ranks from SVM Text Classifiers

Jonathan Dunn


Text classification systems are capable of predicting certain characteristics of a text’s author (e.g., gender and age) using only linguistic properties. This paper asks why such predictions are possible and how they can be interpreted. There are three factors: (1) the nature of the features used by the system; (2) the robustness of the predictions across time and genres; (3) the amount of data required for training and testing. Some classification predictions (e.g., gender) are based on non-content linguistic material that generalizes across time and genre. These classifications are characterized by stable performance and feature ranks, and permit linguistic interpretation.

Full Text:



Mukherjee, A., & Liu, B. (2010). “Improving Gender Classification of Blog Authors.” In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 207–217. Stroudsburg, PA: Association for Computational Linguistics.

Nguyen, D., Smith, N. A., & Ros, C. P. (2011). “Author Age Prediction from Text using Linear Regression.” In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, 115–123. Stroudsburg, PA: Association for Computational Linguistics.

Poole, K., & Rosenthal, H. (2007). Ideology and Congress. Edison, NJ: Transaction Publishers.

Rosenthal, S., & Mckeown, K. (2011). “Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations.” In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 763–772. Stroudsburg, PA: Association for Computational Linguistics.

Sarawgi, R., Gajulapalli, K., & Choi, Y. (2011). “Gender Attribution: Tracing Stylometric Evidence Beyond Topic and Genre.” In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, 78–86. Stroudsburg, PA: Association for Computational Linguistics.

Schler, J., Koppel, M., Argamon, S, & Pennebaker J. (2006). “Effects of age and gender on blogging.” In Proceedings of the AAAI Spring Symposium Computational Approaches to Analyzing Weblogs. Palo Alto, CA: Association for the Advancement of Artificial Intelligence.