Original Research

Automatic genre classification for Afrikaans

Dirk Snyman, Gerhard van Huyssteen, Walter Daelemans
Suid-Afrikaanse Tydskrif vir Natuurwetenskap en Tegnologie | Vol 33, No 1 | a759 | DOI: https://doi.org/10.4102/satnt.v33i1.759 | © 2014 Dirk Snyman, Gerhard van Huyssteen, Walter Daelemans | This work is licensed under CC Attribution 4.0
Submitted: 08 August 2013 | Published: 24 November 2014

About the author(s)

Dirk Snyman, Centre for Text Technology, North-West University, South Africa
Gerhard van Huyssteen, Centre for Text Technology, North-West University, South Africa
Walter Daelemans, Computational Linguistics and Psycholinguistics Research Group, University of Antwerpen, Belgium


Share this article

Bookmark and Share

Abstract

When working in the terrain of text processing, metadata about a particular text plays an important role. Metadata is often generated, using automatic text classification systems which classify a text into one or more predefined classes or categories based on its contents. One of the dimensions by which a text can be can be classified, is its genre. In this study the development of an automatic genre classification system in a resource scarce environment is postulated. This study aimed to investigate the techniques and approaches that are generally used for automatic genre classification systems, and identify the best approach for Afrikaans (a resource scarce language). With the development of an automatic genre classification system, there is a set of variables that must be considered as they influence the performance of machine learning approaches (i.e. the algorithm used, the amount of training data, and data representation as features). If these variables are handled correctly, an optimal combination of them can be identified to successfully develop a genre classification system. In this article a genre classification system is being developed by using the following approach: The implementation of a MNB algorithm with a bag of words approach feature set. This system provides a resultant f-score (performance measure) of 0.929.

Keywords

Genreklassifikasie, Hulpbronskaars Tale, Masjienleer, Mensetaaltegnologie, Natuurliketaalprosessering

Metrics

Total abstract views: 2323
Total article views: 2934

Reader Comments

Before posting a comment, read our privacy policy.

Post a comment (login required)

Crossref Citations

No related citations found.