A new term‐weighting scheme for naïve Bayes text categorization
International Journal of Web Information Systems
ISSN: 1744-0084
Article publication date: 30 March 2012
Abstract
Purpose
Automatic text categorization has applications in several domains, for example e‐mail spam detection, sexual content filtering, directory maintenance, and focused crawling, among others. Most information retrieval systems contain several components which use text categorization methods. One of the first text categorization methods was designed using a naïve Bayes representation of the text. Currently, a number of variations of naïve Bayes have been discussed. The purpose of this paper is to evaluate naïve Bayes approaches on text categorization introducing new competitive extensions to previous approaches.
Design/methodology/approach
The paper focuses on introducing a new Bayesian text categorization method based on an extension of the naïve Bayes approach. Some modifications to document representations are introduced based on the well‐known BM25 text information retrieval method. The performance of the method is compared to several extensions of naïve Bayes using benchmark datasets designed for this purpose. The method is compared also to training‐based methods such as support vector machines and logistic regression.
Findings
The proposed text categorizer outperforms state‐of‐the‐art methods without introducing new computational costs. It also achieves performance results very similar to more complex methods based on criterion function optimization as support vector machines or logistic regression.
Practical implications
The proposed method scales well regarding the size of the collection involved. The presented results demonstrate the efficiency and effectiveness of the approach.
Originality/value
The paper introduces a novel naïve Bayes text categorization approach based on the well‐known BM25 information retrieval model, which offers a set of good properties for this problem.
Keywords
Citation
Mendoza, M. (2012), "A new term‐weighting scheme for naïve Bayes text categorization", International Journal of Web Information Systems, Vol. 8 No. 1, pp. 55-72. https://doi.org/10.1108/17440081211222591
Publisher
:Emerald Group Publishing Limited
Copyright © 2012, Emerald Group Publishing Limited