Text Classification of News Articles Using Machine Learning on Low-resourced Language: Tigrigna
Published in 3rd International Conference on Artificial Intelligence and Big Data (ICAIBD), 2020
Text categorization or Textual document is a method that becomes more significant in tagging a textual document to their most relevant label. However, not all languages have parallel textual growth; without free and absences of a dataset, text categorization becomes interesting for Tigrigna language, i.e., low-resourced language. Our aim to identify the given document to its categories based on its linguistic features. To achieve our goal, we have constructed a new dataset from different Tigrigna news sources. The dataset has six main categories: Agriculture, Sports, Health, Education, Religion, and Politics. Each collected is article preprocessed from Latin characters, punctuations, and stop words. We deployed a collection of different classical machine learning classifiers to investigate its effectiveness in our datasets. Namely, 7 popular classifiers were used, Logistic Regression, Nearest Centroid, Decision Tree (DT).