Date Published: May 15, 2019
Publisher: Public Library of Science
Author(s): Akram Osman, Naomie Salim, Faisal Saeed, Wajid Mumtaz.
The Text Forum Threads (TFThs) contain a large amount of Initial-Posts Replies pairs (IPR pairs) which are related to information exchange and discussion amongst the forum users with similar interests. Generally, some user replies in the discussion thread are off-topic and irrelevant. Hence, the content is of different qualities. It is important to identify the quality of the IPR pairs in a discussion thread in order to extract relevant information and helpful replies because a higher frequency of irrelevant replies in the thread could take the discussion in a different direction and the genuine users would lose interest in this discussion thread. In this study, the authors have presented an approach for identifying the high-quality user replies to the Initial-Post and use some quality dimensions features for their extraction. Moreover, crowdsourcing platforms were used for judging the quality of the replies and classified them into high-quality, low-quality or non-quality replies to the Initial-Posts. Then, the high-quality IPR pairs were extracted and identified based on their quality, and they were ranked using three classifiers i.e., Support Vector Machine, Naïve Bayes, and the Decision Trees according to their quality dimensions of relevancy, author activeness, timeliness, ease-of-understanding, politeness, and amount-of-data. In conclusion, the experimental results for the TFThs showed that the proposed approach could improve the extraction of the quality replies and identify the quality features that can be used for the Text Forum Thread Summarization.
An increase in the web services has facilitated the manner in which people accessed and shared the knowledge in the form of User-Generated Content regarding specific subjects on the internet. The Text Forum Threads (TFThs) is the web service wherein the users can initiate discussions by posting Initial-Posts, asking for help and initiating conversations related to specific topics. Other users then read these Initial-Posts and reply accordingly. Hence, the Initial-Post generates many replies in a single thread. The Initial-Post along with its replies are compiled together in one thread. In this study, the authors have referred to the threads as Initial-Posts Replies pairs (IPR pairs). Fig 1 describes the manner in which every reply in the thread responds to the Initial-Posts on a particular topic. It can be seen that the discussion thread presented in the forum contains valuable information that is hidden in the forum texts. An effective use of this information in the User-Generated Content is an important topic of research in the field of thread retrieval. Determining important information in the text forums can become very difficult because of the information overloading.
Identifying the quality features to extract quality replies to an Initial-Post in the TFThs can be difficult. Many studies have been carried out with regards to the issues seen in the TFThs. In this study, the authors have presented a literature review of the studies, which are directly related to this work.
In this study, a Classified Quality Initial-Post Replies Model (CQIPRM) is developed, which consists of five main components, as described in Fig 2. Details of every component is discussed as follows.
For understanding and evaluating the quality of the IPR pairs in the TFThs, the replies were classified into three categories. The authors used 28 different quality features, which were divided into six QDs like; relevancy (D1) [33, 36, 39, 40, 59, 60], author activeness (D2) [59–63], timeliness (D3)) [5, 17, 33, 62, 64–66] ease-of-understanding (D4) [13, 21, 65, 67], politeness (D5) [2, 21, 68], and the amount-of-data (D6) [5, 7, 33, 39, 59, 62, 69, 70]. Table 1 summarizes these QDs features, while Table 2 lists the QDs features formulas.
A classification of the user replies based on their response to the Initial-Posts could be helpful in the TFThs. In this study, the researchers have described the manner in which the reply classification information is used in the TFThs system. They have incorporated the class label information about the replies in the dataset for determining if it improved the TFThs system. Based on the human judgments, the replies are classified into three classes, i.e., to evaluate each reply in the thread, i.e., non-quality, low-quality, and high-quality replies. In Table 3, the authors have presented an example of the discussion thread containing an Initial-Post and replies with the class Labels, which were represented by the nominal values. The class labels display the information below:
In this study, two datasets were used—the online TripAdvisor forum (https://www.tripadvisor.com.my/ShowForum-g28953-i4-New_York.html) for New York City (NYC) along with the online Ubuntu Linux distribution forum (http://ubuntuforums.org) . The two datasets comprised of discussion threads, where every IPR pairs generated a thread. The statistics for both the datasets have been provided in Table 4 by .
In the subsequent subsections, descriptions regarding the classification result, reduction result and more discussions on the result via confusion matrix have been provided. Moreover, this work has been compared by the authors with a related work (baseline).
In this study, human judgment and the quality dimensions features for identifying the best quality features were exploited to detect the relevant user replies to the Initial-Posts in a discussion thread (IPR pairs) to help in detecting the quality of the user replies in the TFThs. Six QDs features were studied using the discussion thread structure for assessing the user reply quality, which included the relevancy, author activeness, timeliness, ease-of-understanding, amount-of-data, and politeness dimension features. Thereafter, the values of the quality features for every reply were estimated. Human judgment was also used to classify the replies as high-quality, low-quality or non-quality. The SVM, NB and J48 classifiers were applied to classify the replies in any one out of the three groups mentioned above. Additionally, the features selection techniques of Information Gain, Chi-square and Gain Ratio were used as these were better indicators for identifying the quality of the replies along with the best quality dimensions features. According to these experiments, the model was able to identify the appropriate quality features from the six QDs features for the TFThs, thereby improving the extraction of high-quality replies from the thread. Furthermore, this model also possessed a good classification ability which helped in identifying the high-quality users. It is believed that this proposed model will be able to support content filtering and specific forum searches. In future, this work can be further expanded to include text forum threads summarization.