Machine learning for Asian language text classification
Abstract
Purpose
The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task.
Design/methodology/approach
Naïve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation‐based approach was compared with the non‐segmentation‐based approach.
Findings
There were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy.
Practical implications
Apply the findings to real web text classification is ongoing work.
Originality/value
The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.
Keywords
Citation
Peng, F. and Huang, X. (2007), "Machine learning for Asian language text classification", Journal of Documentation, Vol. 63 No. 3, pp. 378-397. https://doi.org/10.1108/00220410710743306
Publisher
:Emerald Group Publishing Limited
Copyright © 2007, Emerald Group Publishing Limited