Date Published: April 5, 2017
Publisher: Public Library of Science
Author(s): Guozhong Feng, Baiguo An, Fengqin Yang, Han Wang, Libiao Zhang, Quan Zou.
Feature selection is a practical approach for improving the performance of text classification methods by optimizing the feature subsets input to classifiers. In traditional feature selection methods such as information gain and chi-square, the number of documents that contain a particular term (i.e. the document frequency) is often used. However, the frequency of a given term appearing in each document has not been fully investigated, even though it is a promising feature to produce accurate classifications. In this paper, we propose a new feature selection scheme based on a term event Multinomial naive Bayes probabilistic model. According to the model assumptions, the matching score function, which is based on the prediction probability ratio, can be factorized. Finally, we derive a feature selection measurement for each term after replacing inner parameters by their estimators. On a benchmark English text datasets (20 Newsgroups) and a Chinese text dataset (MPH-20), our numerical experiment results obtained from using two widely used text classifiers (naive Bayes and support vector machine) demonstrate that our method outperformed the representative feature selection methods.
Text classification has been applied in many contexts, ranging from document indexing based on a controlled vocabulary, to document filtering, automated metadata generation, word sense disambiguation, hierarchical cataloguing of web resources, and in general any application requiring document organization or selective and adaptive document dispatching . Many classification algorithms have been proposed for text classification, such as the naive Bayes (NB) classifier, k-nearest neighbors, and support vector machine (SVM) .
In this section, we will briefly describe some related works including the state-of-the-art feature selection methods used for text classification. To this end, we will introduce the bag-of-words model first. A toy example is given in Example 1.
Due to the good performance of WCP, we will revisit the probabilistic popularity of the terms and try to look for a model based scheme to measure the term information in this section.
In this study, we conducted two series of experiments under various experimental circumstances to evaluate the performance of the feature selection methods. To accomplish this, we compared three TF based feature selection methods (including our RP) and two DF based methods on a Chinese corpora and a popular benchmark data English corpora. We look for performance differences between the TF based feature selection methods and the DF based ones from the view of selecting features using the available Chinese dictionary in the first series of experiments. The second series experiments were performed to explore the superiority of the feature selection methods by the classification effectiveness using two state-of-the-art text classifiers: the Multinomial NB classifier and the SVM classifier.
We proposed a novel feature selection scheme via a widely used probabilistic text classification model. We captured term frequency information within the documents via a term event Multinomial model. To remove complex factors, we employed the logarithmic ratio of the positive class posterior probability to the negative one (e.g. the matching score idea). Then, we obtained a sub-score named relevance popularity of each feature under the well known NB assumption. Finally, we obtained a global feature selection score by using the Gini coefficient estimator [31, 37].