Language Independent n-Gram-Based Text Categorization with Weighting Factors: A Case Study

Jelena Graovac; Jovana Kovačević; Gordana Pavlović-Lažetić

Authors

Jelena Graovac University of Belgrade
Jovana Kovačević University of Belgrade Indiana University
Gordana Pavlović-Lažetić University of Belgrade

Keywords:

Arabic, byte-level n-gram, English, kNN, natural language text categorization

Abstract

We introduce a new language independent text categorization technique based on n-grams profile representation of restricted size of both document and a category, an n-gram weighting factors scheme, and a simple algorithm for comparing profiles. The technique does not require any morphological analysis of texts, any preprocessing steps, or any prior information about document content or language. We apply it to the text categorization problem in two widely spoken yet paradigmatically quite different languages – English and Arabic, thus demonstrating language-independence. We used their publicly available document collections – 20-Newsgroups and Mesleh-10, respectively. Experimental results presented in terms of macro- and micro-averaged F1 measures imply that the new technique outperforms other n-gram based and bag-of-words machine learning techniques when applied to English and Arabic text categorization.

Author Biographies

Jelena Graovac, University of Belgrade

Jelena Graovac received her MS degree in 2008, and PhD in Computer Science in 2014, both of them from the University of Belgrade, Faculty of Mathematics. She is a Teaching Assistant at the Department for Computer Science, Faculty of Mathematics, University of Belgrade. Her research interests include issues related to Text Processing (Textual Databases, Text Classification, Information Retrieval), Bioinformatics and XML Databases. She is author of research studies published at national and international journals, conference proceedings and book chapters.
Gordana Pavlović-Lažetić, University of Belgrade

Gordana Pavlovic-Lazetic received her BSc, MSc and PhD from the University of Belgrade, Faculty of Mathematics. She is a Full Professor at the Department of Computer Science, Faculty of Mathematics, University of Belgrade. Her research interests include databases, bioinformatics, data and text mining. She spent two years at the University of California, Berkeley, as a visiting scholar and a consultant at Relational Technology, Inc. She authored a number of papers in journals and books published by the renowned publishers. A member of the Association for Computing Machinery (ACM).

Language Independent n-Gram-Based Text Categorization with Weighting Factors: A Case Study

Authors

Keywords:

Abstract

Author Biographies

Downloads

Published

Issue

Section

Developed By

Language

Information