Text Classification based on the Bias of Word Frequency over Categories

M. Suzuki

Text Classification based on the Bias of Word Frequency over Categories

M. Suzuki (Japan)

Keywords

text categorization, automatic classification, vector space model, tfidf

Abstract

In automatic text classification, for example, for classifying newspaper articles into predefined categories such as politics and sports, the crucial step is how to select appropriate keywords. With traditional classification methods based on the vector space model, frequent words are emphasized and therefore low frequency words tend to be disregarded. However, there often exist low-frequency words that are effective for classification. For instance, technical terms appear in specific categories so their frequencies are generally low, even though they are effective keywords. In this paper, we propose two text classification methods, namely, NDF method and accumulation method, that are based on the bias of word frequency distribution over categories. Our experiments show that our accumulation method outperforms a traditional method based on the vector space model.

Important Links:

DOI:
From Proceeding (502) Artificial Intelligence and Applications - 2006

Go Back