Search results

1 – 2 of 2

Per page

10 20 50

(0)

Citations:

View access options

Article

Publication date: 1 March 1997

Application of probabilistic methods to Chinese

Xiangji Huang and S.E. Robertson

The use of text retrieval methods based on the probabilistic model with Chinese language material is discussed. Since Chinese text has no natural word boundaries, we must either…

HTML

PDF (252 KB)

Downloads

169

Abstract

The use of text retrieval methods based on the probabilistic model with Chinese language material is discussed. Since Chinese text has no natural word boundaries, we must either apply a dictionary‐based word segmentation method to the text, or index and search in terms of single Chinese characters. In either case, it becomes important to have a good way of dealing with phrases or contiguous strings of characters; the probabilistic model does not at present have such a facility. Some ad hoc modificatkions of the probabilistic weighting function and matching method are proposed for this purpose.

Details

Journal of Documentation, vol. 53 no. 1

Type: Research Article

DOI:

ISSN: 0022-0418

Keywords

View access options

Article

Publication date: 1 May 2007

Machine learning for Asian language text classification

Fuchun Peng and Xiangji Huang

The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word…

HTML

PDF (247 KB)

Downloads

974

Abstract

Purpose

The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task.

Design/methodology/approach

Naïve Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentation‐based approach was compared with the non‐segmentation‐based approach.

Findings

There were two findings: the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features; and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy.

Practical implications

Apply the findings to real web text classification is ongoing work.

Originality/value

The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.

Details

Journal of Documentation, vol. 63 no. 3

Type: Research Article

DOI:

ISSN: 0022-0418

Keywords

Access

Year

All dates (2)
From To Go

Content type

Article (2)

1 – 2 of 2

Per page

10 20 50

Application of probabilistic methods to Chinese

Abstract

Details

Keywords

Machine learning for Asian language text classification

Abstract

Purpose

Design/methodology/approach

Findings

Practical implications

Originality/value

Details

Keywords

Access

Year

Content type

All feedback is valuable

Report an issue or find answers to frequently asked questions