Search results

1 – 2 of 2
Per page
102050
Citations:
Loading...
Access Restricted. View access options
Article
Publication date: 1 March 1997

Xiangji Huang and S.E. Robertson

The use of text retrieval methods based on the probabilistic model with Chinese language material is discussed. Since Chinese text has no natural word boundaries, we must either…

169

Abstract

The use of text retrieval methods based on the probabilistic model with Chinese language material is discussed. Since Chinese text has no natural word boundaries, we must either apply a dictionary‐based word segmentation method to the text, or index and search in terms of single Chinese characters. In either case, it becomes important to have a good way of dealing with phrases or contiguous strings of characters; the probabilistic model does not at present have such a facility. Some ad hoc modificatkions of the probabilistic weighting function and matching method are proposed for this purpose.

Details

Journal of Documentation, vol. 53 no. 1
Type: Research Article
ISSN: 0022-0418

Keywords

Access Restricted. View access options
Article
Publication date: 1 May 2007

Fuchun Peng and Xiangji Huang

The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word…

974

Abstract

Purpose

The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task.

Design/methodology/approach

Naïve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation‐based approach was compared with the non‐segmentation‐based approach.

Findings

There were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy.

Practical implications

Apply the findings to real web text classification is ongoing work.

Originality/value

The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.

Details

Journal of Documentation, vol. 63 no. 3
Type: Research Article
ISSN: 0022-0418

Keywords

1 – 2 of 2
Per page
102050