Sentiment detection in micro-blogs using unsupervised chunk extraction
© Magistry et al. 2016
Received: 30 July 2015
Accepted: 26 December 2015
Published: 26 January 2016
In this paper, we present a proposed system designed for sentiment detection for micro-blog data in Chinese. Our system surprisingly benefits from the lack of word boundary in Chinese writing system and shifts the focus directly to larger and more relevant chunks. We use an unsupervised Chinese word segmentation system and binomial test to extract specific and endogenous lexicon chunks from the training corpus. We combine the lexicon chunks with other external resources to train a maximum entropy model for document classification. With this method, we obtained an averaged F1 score of 87.2 which outperforms the state-of-the-art approach based on the released data in the second SocialNLP shared task.
Recently, due to its great potential applications such as opinion mining and topic detection, sentiment analysis on micro-blog data has gained much attention than ever before. The state-of-the art approaches to sentiment analysis/detection involves attributing a polarity to a textual message. The polarity may accept different sets of values depending on the tasks, such as ratings and binary or ternary values (positive, negative, neutral).
The original form of this work had been prepared for the participation in the shared task at the second SocialNLP workshop. The task targets on sentiment detection in Chinese micro-blogs, which posts were extracted from Plurk online service and were mainly written in Modern Standard Chinese (MSC) with some code switching or code mixing in English, Japanese, and Taiwanese Hokkien. Messages are provided with meta-data, including timestamps, user IDs of the original posters and repliers. Besides, the posts were grouped topic-wise into 95 files by the task organizersa.
In this task, the provided Plurk micro-blogging messages are typically short and are manually annotated with positive or negative polarities by the organizer. In addition to the provided data, external resources of other kinds are also combined for this task, which will be detailed in Section 3. Although it is worth noting that applying result comparisons on different languages or corpora are hazardous for this task, we achieved a score which resembles state-of-the-art on similar tasks in other languages.
2 Related works
Recent years have witnessed an increasing interest in the domain of micro-blogs such as Twitter, Facebook, and Sina Weibo. Since micro-blogs possess unique characteristics of limited length, ambiguous/error-prone, rich of acronyms, slangs, and flexible and novel usages, these would hinder the exploitation of sentiment analysis techniques (Liu 2012) in general. Additionally, different genres that appear in the social media (such as blogs, message boards, news, and micro-blogs) have brought different levels of difficulties for the analysts. To date, there has been little work on sentiment detection using Chinese micro-blogs. In the study by Yang (2009), F1 score of 64.8 % was reported by tracing a single Plurker’s post sentiments for a few months; while 77.4– 81.6 % was presented in Huang et al. (2012) by applying hybrid methods using posts from Sina Weibo.
Methodologically, sentiment detection is typically conducted using a supervised learning approach. As sentiment analysis is quite domain dependent, it is very hard to come up with a generic, one-box-fits-all approach for the task. Previous works rely on a large variety of machine learning algorithms, such as support vector machine (SVM) (Hu et al. 2007), maximum entropy (MaxEnt) (Lee and Renganathan 2011), and recurrent neural network (RNN) (Socher et al. 2013). Feature selections for these algorithms are mostly bag-of-word approaches; whereas grammar-specific rules may be involved as well to refine the features (Lee and Renganathan 2011; Thelwall et al. 2010). More recently, in addition to using lexicon-based features, some practitioners advocate applying Multi-Word Expression (MWE) recognition and/or syntactic parsing (Socher et al. 2013; Hou and Chang 2013).
2.1 Our strategy
Prior works as previously reviewed, either by exploiting various methods or resources, such as adding supplementary features, enhancing text preprocessing steps, and performing different weighting schemes onto the words, to some extent, all of these stick to the extension and improvement of bag-of-word models. Therefore, in this paper, we would take bag-of-word approaches to construct feature sets for sentiment detection.
However, some concerns exist when using bag-of-word models. In Chinese, since there are no delimiters used in word boundaries, a long controversy over wordhood issues has been raised among linguists. While applying bag-of-word models, the existence of wordhood needs to be assumed in advance, and a word segmentation system should be involved in preprocessing steps, for languages whose orthographic word boundary is not explicitly marked. Computational works in Chinese NLP-related studies used to take the task of wordhood assessment or word segmentation as a discrete (binary) decision, instead of continuous data segmentation, and have to presume a priori agreement (in whatever sense) that guides the production of segmented data, which could be further divided into training and testing data for follow-up procedures.
In our strategy, to begin with a bag-of-word approach applied to Chinese, as similar to those proposed for languages (e.g., English) written in Latin-based scripts, a Chinese Word Segmentation (CWS) procedure would be needed at the very beginning of the processing. Although this procedure is very likely to add some noisy labels to the data, but a naive tokenization on characters without any other segmentation procedure (i.e., a bag-of-character approach) is very unlikely to succeed in Chinese. Therefore, we have introduced an unsupervised CWS system into this task, regardless of the lack of training corpora specific to micro-blogs for training a CWS system. However, it is also found that segmented data resulting from CWS does not necessarily carry emotional concepts if not being placed into context. Thus, we argue that it would be more sensible to aim directly at larger chunks, which might correspond to MWEs in other languages.
Systems which do not follow the bag-of-word approach typically rely on syntactic parsing to obtain larger constituents and base their analysis on more complex structures. Unfortunately, such approaches are problematic to address Chinese micro-blog classification as they rely on current segmentation, tagging, and syntactic parsing systems which are trained on a very different genre of texts, which is typically cleaner and more formal. Given the lively and spontaneous nature of micro-blogs, it is hazardous to rely on processing tools which were trained on mostly journalistic data. On top of this, state-of-the art methods for English rely on a sentiment Treebank in which each node of the syntactic tree is annotated. Such resources are only available for English and would be time consuming to develop one for Chinese or other languages. Moreover, given how the topic can affect the polarity lexicon, the adaptability of the method to other domain is questionable. Hence, our choice is to rely on an unsupervised analysis which only needs raw data to build the features for our supervised classifier.
We believe that joining this task using micro-blogs would also be a very good test case for our unsupervised segmentation system (Magistry and Sagot 2012). In our segmentation system, MWE-like chunks are extracted instead of words, and the polarity of each chunk is directly learned from the training data. This method turns out to be particularly efficient as relevant keywords or key expressions for discriminating polarities within context may well be corpus specific. Therefore, in this case, the chunks with automatically learned polarities should be used in this task only but not getting collected into generic sentiment lexica lists. As data sparsity and dynamics remain a challenging issue under this approach, despite the extracted chunks, we also tried to enrich the lexica using other sources. In this section, materials and detailed procedures of the study are presented below.
3.1 Datasets for training and testing
In previous mood classification experiments by Chen et al. (2010), a set of Plurk corpus was collected and manually annotated with four emotion tags (positive, interrogative, negative, and unknown). From the corpus, a total of 3619 positive and 8373 negative posts are taken as gold standard data for training the system.
Additionally, we crawled some data from PTT to build another lexicon. PTT is the largest bulletin board system (BBS) in Taiwan, containing sheer numbers of boards. Among these boards, the Hate and Happy boards are filled with emotion-related expressions. The Hate board provides a platform for people to vent the hateful feelings of the unpleasant things they encounter daily; whereas, Happy board collects all the delightful things that people experience everyday. We randomly retrieved 6000 messages from the two boards relatively. Those who posted on the Hate board are considered to be negative, and those who posted on the Happy board are considered to be positive.
In the SSNRE project, emotion words could be grouped into emotion-inducing and emotion-describing words. Additionally, a total of 395 emotion-inducing words and 218 emotion-describing words have underwent three psychological experiments with a 9-point Likert scale. Based on the values evaluated via experiments, the emotion-inducing words (140 positive and 255 negative words) and emotion-describing words (58 positive and 160 negative words) could be further categorized into positive and negative word lists for training the system.
Moreover, MOE Dictionary, an online open-sourced Chinese dictionary, is also collected as training data.
Unsupervised chunk extraction using unsupervised CWS system (with NVBE formulation)
Supervised polarity identification of extracted chunks (with a binomial test)
Supervised sentiment detection of micro-blog messages (with a Maximum Entropy model)
3.2 Unsupervised extraction of the chunk list
To begin with, the provided Plurk training data, Chen et al. (2010) Plurk data, and PTT corpus are taken to train the unsupervised CWS system (Magistry and Sagot 2012). Within the unsupervised CWS system, Magistry and Sagot (2012) have proposed a state-of-the-art unsupervised segmentation algorithm, based on an autonomy measure computed from a normalized variation of branching entropy (NVBE). After the above unsupervised CWS system is trained, NVBE algorithm is then applied to extract a list of chunks. Here is a brief recall of NVBE formulation.
where μ k is the mean of all the values VBE (y) such as len (y)=k and σ k is the standard deviation of all the values VBE (y) such as len (y)=k.
In previous works, NVBE has shown to be a good empirical estimate of syntactic autonomy of a n gram. The segmentation was performed by maximizing the autonomy estimate. However, for our task, we do not need the actual segmentation. We simply extract a list of autonomous expressions (possibly larger than words) in which their polarity is tested afterwards. For our task, we do not need to actually perform the whole segmentation procedure (i.e., without outputting/using the final segmented results) but to train and use the algorithms from the proposed unsupervised CWS system, in order to get a list of chunks for the follow-up polarity testing. We select an autonomous chunk based on NVBE values, with positive values presented on the left and right branchings.
One thing to be noted is that although punctuation marks and non-Chinese characters are usually handled or removed during pre-processing steps, this kind of information remains in this task. This is done for the reason that in micro-blog messages, information such as punctuation marks and other non-Chinese characters can be used in surprising ways. For example, smileys, frequently used in micro-blogs for conveying emotions, are mostly composed of punctuation marks, symbols, and characters. Therefore, with this characteristic of micro-blogs, we expect to recognize smileys in an unsupervised fashion.
3.3 Constructing endogenous polarity chunks
Once a list of autonomous chunks is extracted, we need to determine whether these chunks are specific to a polarity or not. Here, the resource of MOE dictionary is also included to be identified and enlarge the chunk list. In order to do that, the number of times they appear in the training data containing positive and negative messages are counted, and then are compared with the overall proportion of positive over negative messages using a binomial test.
Here, the obtained p value from a binomial test is not identified in a traditional way which a significance level at 0.05 or 0.01 are usually defined. As we may want to favor large coverage of chunks over statistical significance, the scope is broader. However, to keep some information about the level of significance or certainty regarding our decisions, we use binning to define 5 classes of positive and 5 classes of negative items. We call these 5 classes of confidence levels (in our experiment, 0.3>A>0.2>B>0.1>C>0.05>D>0.01>E) and use them to further construct a list of autonomous chunks with endogenous polarities (in short, named as endogenous polarity chunks).
3.4 Maximum entropy classification
As the approach proposed in this paper is based on lexicalized features, we combine lexica from multiple sources and expect the features to be interdependent with a large amount. Additionally, it is hoped that once trained, the model can provide insights of the lexica and the task. For these reasons, we prefer to choose MaxEnt modeling as our machine learning algorithm for sentiment detection over other models, which are biased by having overlapping features (e.g., Naive Bayes models) or are used as uninterpretable black boxes (e.g., SVM or RNN).
Summary of the final lexica and chunks prepared for training a MaxEnt classifier
PTT Hate and Happy
Use the endogenous polarity chunks to perform a maximum forward matching segmentation on the given Plurk training dataset, looking for chunks with polarities of higher confidence levels and of the longest length first.
Determine and annotate whether the polarity of a chunk is mostly positive or negative.
Damn. This sentence is so funny that I am laughing my ass off. How could you guys be so into bananas?
4 Results and discussion
The use of the auxiliary data as training for the MaxEnt model
One more step to rebuild the unsupervised lexica/chunks with insights from the model’s feature weights
Taking advantage of the division into topics and of the provided meta-data
Future refinement of the system could also include rules for feature modification as has been done in Lee and Renganathan (2011). Since some topics only contain a small amount of messages, we decided to leave the topic division aside in our first experiment. This study shows that we have not reached the plateau of the learning curve yet. However, the final evaluation results present that our model can still benefit from a larger training data. In addition, the ablation test showing how much lexical resources contribute to the score will be conducted as well.
So far, most of the works on Chinese sentiment analysis have been heavily relied on bag-of-word-alike paradigm which requires identifying wordhood as the pre-processing task. On the contrary, we propose an unsupervised CWS model into sentiment analysis. We believe that the model proposed in this paper is very promising not only due to its treatment of necessary indeterminacy of wordhood issue with agnostic methodology but also its great achievement of performance over others in sentiment-related processing tasks.
a Unfortunately, participants in this shared task were too few to provide a comprehensive comparison of the results.
c It is found that there was a popular and funny banana dance video widely spread among the Plurkers at that time, which explains why banana is tagged as positive with the highest confidence level in this message.
This research was supported in part by the Erasmus Mundus Action 2 program MULTI of the European Union, grant agreement number 2010-5094-7. Additionally, we thank the anonymous reviewers’ comments on this study.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Chen, Mei-Yu, Hsin-Ni Lin, Chang-An Shih, Yen-Ching Hsu, Pei-Yu Hsu, and Shu-Kai Hsieh. 2010. Classifying mood in plurks. In Proceedings of the 22nd Conference on Computational Linguistics and Speech Processing (ROCLING 2010),172–183. Puli, Nantou, Taiwan: The Association for Computational Linguistics and Chinese Language Processing.Google Scholar
- Cheng, Chao-Ming, Hsueh-Chih Chen, and Shu-Ling Cho. 2012. Affective words. In A study on standard stimuli and normative responses of emotion in Taiwan. Taipei, Taiwan: National Taiwan University.Google Scholar
- Hou, Wen-Juan, and Chuang-Ping Chang. 2013. Sentiment classification for movie reviews in Chinese using parsing-based methods. In International Joint Conference on Natural Language Processing,561–569. Nagoya, Japan: Asian Federation of Natural Language Processing.Google Scholar
- Hu, Yi, Ruzhan Lu, Yuquan Chen, and Jianyong Duan. 2007. Using a generative model for sentiment analysis. Computational Linguistics and Chinese Language Processing 12: 107–126.Google Scholar
- Huang, Sui, Jianping You, Hongxian Zhang, and Wei Zhou. 2012. Sentiment analysis of Chinese micro-blog using semantic sentiment space model. In 2012 2nd International Conference on Computer Science and Network Technology (ICCSNT)1443–1447. Changchun, China: IEEE.Google Scholar
- Lee, Huey Yee, and Hemnaath Renganathan. 2011. Chinese sentiment analysis using maximum entropy. In Proceedings of the Workshop on Sentiment Analysis Where AI Meets Psychology (SAAIP), IJCNLP 2011,89–93. Chiang Mai, Thailand: Asian Federation of Natural Language Processing.Google Scholar
- Liu, Bing. 2012. Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies 5: 1–167.View ArticleGoogle Scholar
- Magistry, Pierre, and Benoît Sagot. 2012. Unsupervized word segmentation: The case for Mandarin Chinese. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2,383–387. Jeju, Republic of Korea: Association for Computational Linguistics.Google Scholar
- Socher, Richard, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Conference on Empirical Methods in Natural Language Processing (EMNLP)1631–1642. Seattle, USA: Association for Computational Linguistics.Google Scholar
- Thelwall, Mike, Kevan Buckley, Georgios Paltoglou, Cai Di, and Arvid Kappas. 2010. Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology 61: 2544–2558.View ArticleGoogle Scholar
- Yang, Ting-Hao.2009. Mining blogger’s glossary impressions from a micro-blog corpus. Master thesis. Hsin-Chu Taiwan: National Tsing-Hua University.Google Scholar