Jinbeom Kang, E. Lee, K. Hong, Jeahyun Park, T. Kim, Juyoung Park, J. Choi, and J. Yang (Korea)
feature selection, impurity of words, unbalanced distribu tion, machine learning, text classification
Feature selection in machine learning is a task of identify ing a set of representative terms or features from a docu ment collection that are mainly used in text classification. Existing feature selection methods including information gain and χ2 -test focus on those features that are useful for all topics, and consequently lack the power of selecting those features that are truly the representatives of a par ticular topic (or class). Also, these methods assume that the distribution of documents for each class is balanced. However, this assumption affects negatively to the classi fication accuracy because real-world document collections rarely have a balanced distribution, and also it is difficult to prepare a set of training documents with even number of documents for each class. To resolve this problem, we propose a new feature selection method for text classification that focuses on the purity of a word that emphasizes its representativeness for a particular class. Also our method assumes unbalanced distribution of documents over multiple classes, and com bines feature values with the weight factors that reflect the number of training documents in each class. In summary, we can obtain feature candidates using the word purity and then select the features with the unbalanced distribution of documents. Via some experiments, we demonstrate that our method outperforms existing methods.
Important Links:
Go Back