Open Access

Tackling the Problem of Class Imbalance in Multi-class Sentiment Classification: An Experimental Study

   | Jun 06, 2019

Cite

[1] Abbasi, A., France, S., Zhang, Z., Chen, H.: Selecting Attributes for Sentiment Classification Using Feature Relation Networks. IEEE Transactions on Knowledge and Data Engineering, 23 (3), 447-462 (2011).10.1109/TKDE.2010.110Search in Google Scholar

[2] Baccianella, S., Esuli, A., Sebastiani, F.: Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In Proc. of the Int. Conference on Language Resources and Evaluation (2010).Search in Google Scholar

[3] Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics, 14 (1), 1471–2105 (2013).10.1186/1471-2105-14-106364843823522326Search in Google Scholar

[4] Blitzer, M. D., Pereira, F.: Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL-2007), 440-447 (2007).Search in Google Scholar

[5] Błaszczyński, J., Stefanowski, J.: Neighbourhood sampling in bagging for imbalanced data. Neurocomputing, 150 A, 184–203 (2015).Search in Google Scholar

[6] Błaszczyński, J., Stefanowski, J.: Local data characteristics in learning classifiers from imbalanced data. In Advances in Data Analysis with Computational Intelligence Methods, 51–85, Springer (2018).Search in Google Scholar

[7] Brzezinski, D. and Stefanowski, J.: Stream Classification. Encyclopedia of Machine Learning and Data Mining, Springer (2017).10.1007/978-1-4899-7687-1_908Search in Google Scholar

[8] Burns N., Bi Y., Wang H., Anderson T.: Sentiment Analysis of Customer Reviews: Balanced versus Unbalanced Datasets. In: König A., Dengel A., Hinkelmann K., Kise K., Howlett R.J., Jain L.C. (eds) Knowledge-Based and Intelligent Information and Engineering Systems, LNCS, 6881, 161-170 (2011).10.1007/978-3-642-23851-2_17Search in Google Scholar

[9] Chawla, N.: Data mining for imbalanced datasets: An overview. In Maimon O., Rokach L. (eds): The Data Mining and Knowledge Discovery Handbook, Springer, 853–867 (2005).Search in Google Scholar

[10] Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Over-sampling Technique. J. of Artificial Intelligence Research, 16, 341-378 (2002).10.1613/jair.953Search in Google Scholar

[11] Das, S. R., Chen, M. Y.: Yahoo! for Amazon: Sentiment extraction from small talk on the web. Management Science, 53(9), 1375-1388 (2007).10.1287/mnsc.1070.0704Search in Google Scholar

[12] Fernández A., García S., Galar M., Prati R., Krawczyk B., Herrera H.: Learning from Imbalanced Data Sets. Springer (2018).10.1007/978-3-319-98074-4Search in Google Scholar

[13] Fernández, A., Garcia, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research, 61, 863–905 (2018).10.1613/jair.1.11192Search in Google Scholar

[14] Fernandez, A., Lopez, V., Galar, M., Jesus M., Herrera, F.: Analysing the classification of imbalanced data sets with multiple classes, binarization techniques and ad-hoc approaches. Knowledge-Based Systems, 42, 97–110 (2013).10.1016/j.knosys.2013.01.018Search in Google Scholar

[15] Fernández-Navarro, F., Hervás-Martínez, C., Gutiérrez, P.A.: A dynamic oversampling procedure based on sensitivity for multi-class problems. Pattern Recognition, 44, 1821–1833 (2011).10.1016/j.patcog.2011.02.019Search in Google Scholar

[16] Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybridbased approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463-484 (2012).10.1109/TSMCC.2011.2161285Search in Google Scholar

[17] Ganu, G., Elhadad, N., Marian, A.: Beyond the stars: improving rating predictions using review text content. In Proc. of 12th Int. Workshop on the Web and Databases, 9, 1–6 (2009).Search in Google Scholar

[18] Garcia, V., Sanchez, J.S., Mollineda, R.A.: An empirical study of the behaviour of classifiers on imbalanced and overlapped data sets. In Proc. of Progress in Pattern Recognition, Image Analysis and Applications, LNCS, 4756, 397–406 (2007).10.1007/978-3-540-76725-1_42Search in Google Scholar

[19] Han, H., Wen-Yuan, W., Bing-Huan, M.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Advances in intelligent computing, 878-887 (2005).10.1007/11538059_91Search in Google Scholar

[20] He, H., Yang, B., Garcia, E.A., Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. IEEE Int. Joint Conference on Neural Networks, 1322-1328 (2008).Search in Google Scholar

[21] He H., Garcia E.: Learning from imbalanced data. IEEE Transactions on Data and Knowledge Engineering, 21 (9), 1263–1284 (2009).10.1109/TKDE.2008.239Search in Google Scholar

[22] He, H. and Ma, Y.: Imbalanced learning: foundations, algorithms, and applications, Wiley (2013).10.1002/9781118646106Search in Google Scholar

[23] Hido, S., Kashima, H.: Roughly balanced bagging for imbalance data. Statistical Analysis and Data Mining, 2 (5-6), 412–426 (2009).Search in Google Scholar

[24] Hu, M., Liu, B.: Mining and summarizing customer reviews. In Proc. of the 10th ACM SIGKDD Int. Conference on Knowledge Discovery and Data Mining, 168–177 (2004).10.1145/1014052.1014073Search in Google Scholar

[25] Japkowicz, N., Stephen, S.: Class imbalance problem: a systematic study. Intelligent Data Analysis Journal, 6 (5), 429–450 (2002).10.3233/IDA-2002-6504Search in Google Scholar

[26] Jo, T., Japkowicz, N.: Class Imbalances versus small disjuncts. ACM SIGKDD Explorations Newsletter, 6 (1), 40–49 (2004).10.1145/1007730.1007737Search in Google Scholar

[27] Kiritchenko, S., Zhu, X., Mohammad, S.M.: Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research, 50, 723–762 (2014).10.1613/jair.4272Search in Google Scholar

[28] Koppel, M, Schler, J.: The Importance of Neutral Examples for Learning Sentiment. Computational Intelligence, 22, 100–109 (2006).10.1111/j.1467-8640.2006.00276.xSearch in Google Scholar

[29] Krawczyk B., McInnes B.T., Cano A.: Sentiment Classification from Multi-class Imbalanced Twitter Data Using Binarization. In: Martínez de Pisón F., Urraca R., Quintiá n H., Corchado E. (eds) Hybrid Artificial Intelligent Systems, LNCS, 10334, 26–37 (2017).10.1007/978-3-319-59650-1_3Search in Google Scholar

[30] Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: oneside selection. In Proc. of the 14th Int. Conf. on Machine Learning ICML-97, 179-186 (1997).Search in Google Scholar

[31] Kuncheva, L. I.: Combining Pattern Classifiers: Methods and Algorithms: Methods and Algorithms. Wiley (2004).10.1002/0471660264Search in Google Scholar

[32] Lango M., Brzeziński D., Firlik S., Stefanowski J.: Discovering Minority Subclusters and Local Difficulty Factors from Imbalanced Data. In Proc. of the 20th Int. Conference on Discovery Science (2017).10.1007/978-3-319-67786-6_23Search in Google Scholar

[33] Lango M., Brzeziński D., Stefanowski J.: PUT at SemEval-2016 Task 4: The ABC of Twitter Sentiment Analysis, In Proc. of the 10th Int. Workshop on Semantic Evaluation (2016).10.18653/v1/S16-1018Search in Google Scholar

[34] Lango, M., Napierala, K., Stefanowski, J.: Evaluating Difficulty of Multi-class Imbalanced Data. In Proc. of 23rd Int. Symposium on Methodologies for Intelligent Systems, 312–322 (2017).10.1007/978-3-319-60438-1_31Search in Google Scholar

[35] Lango M., Stefanowski J.: Multi-class and Feature Selection Extensions of Roughly Balanced Bagging for Imbalanced Data. Journal of Intelligent Information Systems (2018).10.1007/s10844-017-0446-7Search in Google Scholar

[36] Lemaître G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research, 18 (17), 1–5 (2017).Search in Google Scholar

[37] Li, S., Ju, S., Zhou, G., Li, X.: Active learning for imbalanced sentiment classification. In Proc. of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 139-148 (2012).Search in Google Scholar

[38] Li, S., Wang, Z., Zhou, G., Lee, S. Y. M.: Semi-supervised learning for imbalanced sentiment classification. In Proc. of Int. Joint Conference on Artificial Intelligenc, 22 (3), 1826–1831 (2011).Search in Google Scholar

[39] Li, T., Zhang, Y., Sindhwani, V.: A non-negative matrix tri-factorization approach to sentiment classification with lexical prior knowledge. In Proc. of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th Int. Joint Conference on Natural Language Processing of the AFNLP, 1, 244-252 (2009).Search in Google Scholar

[40] Li, S., Zhou, G., Wang, Z., Lee, S. Y. M., Wang, R.: Imbalanced sentiment classification. In Proc. of the 20th ACM Int. Conference on Information and Knowledge Management, 2469-2472 (2011).10.1145/2063576.2063994Search in Google Scholar

[41] Liu, B.: Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies, Morgan & Claypool (2012).10.2200/S00416ED1V01Y201204HLT016Search in Google Scholar

[42] Loper, E., Bird, S.: NLTK: The natural language toolkit. In Proc. of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, 1, 63–70 (2002).10.3115/1118108.1118117Search in Google Scholar

[43] Mathioudakis, M., Koudas, N.: Twitter-monitor: Trend detection over the twitter stream. In Proc. of the 2010 ACM SIGMOD Int. Conference on Management of Data, 1155–1158 (2010).10.1145/1807167.1807306Search in Google Scholar

[44] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed Representations of Words and Phrases and their Compositionality. In Proc. of Neural Information Systems Processing (2013).Search in Google Scholar

[45] Mohammad, S., Turney, P.D.: Crowd-sourcing a word-emotion association lexicon. Computational Intelligence, 29 (3), 436–465 (2013).10.1111/j.1467-8640.2012.00460.xSearch in Google Scholar

[46] Mountassir, A., Benbrahim, H., Berrada, I.: An empirical study to address the problem of Unbalanced Data Sets in sentiment classification. IEEE Int. Conference on Systems, Man, and Cybernetics (SMC), 3298-3303 (2012).Search in Google Scholar

[47] Nakov, P., Ritter, A., Rosenthal, S., Stoy-anov, V., Sebastiani, F.: SemEval- 2016 task 4: Sentiment analysis in Twitter. In Proc. of the 10th Int. Workshop on Semantic Evaluation (2016).10.18653/v1/S16-1001Search in Google Scholar

[48] Napierala, K., Stefanowski, J.: The influence of minority class distribution on learning from imbalance data. In Proc. of the 7th Int. Conference on Hybrid Artificial Intelligent Systems, LNAI, 7209, 139–150 (2012).10.1007/978-3-642-28931-6_14Search in Google Scholar

[49] Napierala, K., Stefanowski, J.: BRACID: a comprehensive approach to learning rules from imbalanced data. Journal of Intelligent Information Systems, 39 (2), 335–373 (2012).10.1007/s10844-011-0193-0Search in Google Scholar

[50] Napierala, K., Stefanowski, J.: Types of minority class examples and their influence on learning classifiers from imbalanced data. Journal of Intelligent Information Systems, 46(3), 563–597 (2016).10.1007/s10844-015-0368-1Search in Google Scholar

[51] Niklas, J., Weber, S.H., Müller, M.C., Gurevych, I.: Beyond the stars: exploiting free-text user reviews to improve the accuracy of movie recommendations. In Proc. of the 1st Int. Workshop on Topic-sentiment analysis for mass opinion (2009).Search in Google Scholar

[52] Ohana, B., Tierney, B., Delany, S. J.: Domain independent sentiment classification with many lexicons. In 4th Int. Symposium on Mining and Web at 25th Int. Conference on Advanced Information Networking and Applications (AINA), 632–637 (2011).10.1109/WAINA.2011.103Search in Google Scholar

[53] Pang, B., Lee, L.: A Sentimental Education: Sentiment Analysis using subjectivity summarization based on minimum cuts. In: 42nd Annual Meeting on Association for Computational Linguistics, 271–278 (2004).10.3115/1218955.1218990Search in Google Scholar

[54] Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment Classification using Machine Learning Techniques. In: Conference on Empirical Methods in Natural Language Processing, 10, 79–86 (2002).Search in Google Scholar

[55] Pedregosa et al.: Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830 (2011).Search in Google Scholar

[56] Prati, R., Batista, G., Monard, M.: Class imbalance versus class overlapping: an analysis of a learning system behavior. In Proc. of 3rd Mexican Int. Conf. on Artificial Intelligence, 312–321 (2004).10.1007/978-3-540-24694-7_32Search in Google Scholar

[57] Remus, R.: Modeling and representing negation in data-driven machine learning-based sentiment analysis. In Proc. of 1st Int.Workshop on Emotion and Sentiment in Social and Expressive Media (ESSEM 2013), 22–33 (2013).Search in Google Scholar

[58] Schütze, H., Manning, C.D.: Foundations of Statistical Natural Language Processing. MIT Press (1999).Search in Google Scholar

[59] Song, K., Feng, S., Gao, W., Wang, D., Yu, G., Wong, K. F.: Personalized Sentiment Classification Based on Latent Individuality of Microblog Users. In Proc. of Int. Joint Conferences on Artificial Intelligence, 2277-2283 (2015).Search in Google Scholar

[60] Stefanowski, J.: Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In Ramanna, L.C.J.S. and Howlett, R.J. (eds), Emerging Paradigms in Machine Learning, 277–306 (2013).Search in Google Scholar

[61] Stefanowski, J.: Dealing with Data Difficulty Factors while Learning from Imbalanced Data. In S. Matwin and J. Mielniczuk (eds), Challenges in Computational Statistics and Data Mining, Studies in Computational Intelligence, 605, 333–363 (2016).Search in Google Scholar

[62] Stefanowski, J., Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In Song, I.-Y., Eder, J., Nguyen, T.M. (eds) Data Warehousing and Knowledge Discovery, LNCS, 5182, 283–292 (2008).10.1007/978-3-540-85836-2_27Search in Google Scholar

[63] Tomek, I.: Two modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics, 6, 769-772 (2010).Search in Google Scholar

[64] Turney, P.D.: Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL-2002) (2002).10.3115/1073083.1073153Search in Google Scholar

[65] Wallace, B.C., Small, K., Brodley, C.E., Trikalinos, T.A.: Class Imbalance, Redux. In Proc. of IEEE 11th Int. Conference on Data Mining, 754-763 (2011).Search in Google Scholar

[66] Wang, S., Yao, X.: Mutliclass imbalance problems: analysis and potential solutions. IEEE Trans. System Man Cybern., Part B. 42 (4), 1119–1130 (2012).Search in Google Scholar

[67] Wang, S., Yao, X.: Diversity analysis on imbalanced data sets by using ensemble models. IEEE Symp. Comput. Intell. Data Mining, 324–331 (2009).10.1109/CIDM.2009.4938667Search in Google Scholar

[68] Wang, H., Lu, Y., Zhai, C.: Latent aspect rating analysis on review text data: a rating regression approach. In Proc. of the 16th ACM SIGKDD Int. Conference on Knowledge Discovery and Data Mining, 783–792 (2010).10.1145/1835804.1835903Search in Google Scholar

[69] Wiebe, J., Wilson, T., Cardie, C.: Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39 (2-3), 165–210 (2005).Search in Google Scholar

[70] Wilson, D.: Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Cybernetrics, 2 (3), 408-421 (1972).10.1109/TSMC.1972.4309137Search in Google Scholar

[71] Wilson D.R., Martinez T.R.: Improved heterogeneous distance functions. J. Artificial Intelligence Research, 6, 1–34 (1997).10.1613/jair.346Search in Google Scholar

[72] Wojciechowski, S., Wilk, S., Stefanowski, J.: An algorithm for selective preprocessing of multi-class imbalanced data. In Proc. of Int. Conference on Computer Recognition Systems, CORES 2017, 238–247 (2017).Search in Google Scholar

[73] Wojciechowski, S., Wilk, S.: Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data, Foundations of Computing and Decision Sciences, 42(2), 149-176 (2017).Search in Google Scholar

[74] Xu, R., Chen, T., Xia, Y., Lu, Q., Liu, B., Wang, X.: Word Embedding Composition for Data Imbalances in Sentiment and Emotion Classification. Cogn Comput, 7, 226 (2015).Search in Google Scholar

[75] Zhou, Z. H., Liu, X.Y.: On multi-class cost sensitive learning. Computational Intelligence, 26 (3), 232–257 (2010).10.1111/j.1467-8640.2010.00358.xSearch in Google Scholar

eISSN:
2300-3405
Language:
English
Publication timeframe:
4 times per year
Journal Subjects:
Computer Sciences, Artificial Intelligence, Software Development