Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data

[1] Bak, B. A., Jensen, J. L.: High dimensional classifiers in the imbalanced case, Computational Statistics and Data Analysis, 2016, 98, 46-59.10.1016/j.csda.2015.12.009Search in Google Scholar

[2] Batista, G., Silva, D., Prati, R.: An experimental design to evaluate class imbalance treatment methods, in: Proc. of ICMLA’12 (Vol. 2), IEEE, 2012, 95--101.Search in Google Scholar

[3] Caruana, R., Karampatziakis, N., Yessenalina, A.: An empirical evaluation of supervised learning in high dimensions, in: Proc. of the 25th International Conference on Machine Learning (ICML 2008), 2008, 96-103.10.1145/1390156.1390169Search in Google Scholar

[4] Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: Smote: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16, 2002, 341-378.10.1613/jair.953Search in Google Scholar

[5] Demšar, J. Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research, 7, 2006, 1-30.Search in Google Scholar

[6] Dittman, D. J., Khoshgoftaar, T. M., Napolitano, A.: Selecting the appropriatedata sampling approach for imbalanced and high-dimensional bioinformatics datasets. in: Proc. - IEEE 14th International Conference on Bioinformatics and Boengineering (BIBE 2014), 2014, 304-310.Search in Google Scholar

[7] Drummond C., Holte R., Severe class imbalance: Why better algorithms aren’t the answer, in: Proc. of the 16th European Conference on Machine Learning (ECML 2005), Springer, 2005, 539-546.10.1007/11564096_52Search in Google Scholar

[8] Fernández, A., López, V., Galar, M., Del Jesus, M. J., Herrera, F.: Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches, Knowledge-Based Systems, 2013, 42, 97-110.10.1016/j.knosys.2013.01.018Search in Google Scholar

[9] García V., Sánchez J., Mollineda R., An empirical study of the behavior of classifiers on imbalanced and overlapped data sets, in: Proc. of the 12th Iberoamerican Conference on Progress in Pattern Recognition, Image Analysis and Applications, Springer, 2007, 397-406.10.1007/978-3-540-76725-1_42Search in Google Scholar

[10] García V., Sánchez J., Mollineda R., On the k-NN performance in a challenging scenario of imbalance and overlapping, Pattern Analysis and Applications, 11, 3-4, 2008, 269-280.10.1007/s10044-007-0087-5Search in Google Scholar

[11] García V., Sánchez J., Mollineda R., On the effectiveness of preprocessing methods when dealing with different levels of class imbalance, Knowledge-Based Systems, 23, 1, 2012, 13-21.10.1016/j.knosys.2011.06.013Search in Google Scholar

[12] He H., Ma Y., Imbalanced Learning: Foundations, Algorithms and Applications, Wiley, 2013.10.1002/9781118646106Search in Google Scholar

[13] Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data, in: Proc. of the 24th International Conference on Machine Learning (ICML 2007), 2007, 17-23.10.1145/1273496.1273614Search in Google Scholar

[14] Japkowicz N., Stephen S., The class imbalance problem: A systematic study, Intelligent Data Analysis 6, 5, 2002, 429-449.10.3233/IDA-2002-6504Search in Google Scholar

[15] Japkowicz N., Class imbalance: Are we focusing on the right issue, in: Proc. of the 2nd Workshop on Learning from Imbalanced Data Sets, ICML 2003, 2003, 17-23.Search in Google Scholar

[16] Jo T., Japkowicz N., Class imbalances versus small disjuncts, ACM Sigkdd Explorations Newsletter 6, 1, 2004, 40-49.10.1145/1007730.1007737Search in Google Scholar

[17] Kang, P., Cho, S.: EUS SVMs: ensemble of under-sampled SVMs for data imbalance problems, in: Proc. of the 13th International Conference on Neural Information Processing (ICONIP). Springer, 2006, 837-846.10.1007/11893028_93Search in Google Scholar

[18] Krawczyk, B.: Learning from imbalanced data: open challenges and futuredirections, Progress in Artificial Intelligence, 2016, 5 (4), 221-232.10.1007/s13748-016-0094-0Search in Google Scholar

[19] Kubat M., Matwin S., Addressing the curse of imbalanced training sets: one-sided selection, in: Proc. of the 14th International Conference on Machine Learning (ICML 1997), 1997, 179-186.Search in Google Scholar

[20] Laurikkala, J., Improving identification of difficult small classes by balancing class distribution, in: Proc. of the 8th Conference on Artificial Intelligence in Medicine (AIME 2001). LNCS 2101, Springer, 2001, 63-66.10.1007/3-540-48229-6_9Search in Google Scholar

[21] López, V., Fernández, A., García, S., Palade, V., Herrera, F., Empirical results and current trends on using data intrinsic characteristics: Empirical results and current trends on using data intrinsic characteristics, Information Sciences, 2013, 250, 113--141.10.1016/j.ins.2013.07.007Search in Google Scholar

[22] Maaranen H., Miettinen K., Mäkelä M.M., Quasi-random initial population for genetic algorithms, Computer and Mathematics with Applications, 47, 12, 1885-1895.10.1016/j.camwa.2003.07.011Search in Google Scholar

[23] Maciá, M., Bernadó-Mansilla, E., Orriols-Puig, Albert On the dimensions of data complexity through synthetic data sets in: Proceedings of the 11th International Conference of the Catalan Association for Artificial Intelligence. IOS Press, 2008, 244-252.Search in Google Scholar

[24] Napierala K., Stefanowski J., Wilk S., Learning from imbalanced data in presence of noisy and borderline examples, in: Proc. of the 7th International Conference on Rough Sets and Current Trends in Computing (RSCTC 2010). LNAI 6086, Springer, 2010, 158-167.10.1007/978-3-642-13529-3_18Search in Google Scholar

[25] Napierala K., Stefanowski J., Types of minority class examples and their influence on learning classifiers from imbalanced data, Journal of Intelligent Information Systems, 2016, 46, 3, 563-597.10.1007/s10844-015-0368-1Search in Google Scholar

[26] Sáez J.A., Krawczyk B., Wozniak M., Analyzing the oversampling of different classes and types of examples in multi-class imbalanced data sets, Pattern Recognition, 57, 2016, 164-178.10.1016/j.patcog.2016.03.012Search in Google Scholar

[27] Staelin, C., Parameter selection for support vector machines, Technical Report HPL-2002-354 (R.1). HP Laboratories, Israel, 2003.Search in Google Scholar

[28] Tang, Y., and Zhang, Y.-Q., Chawla, N., Krasser, S.: SVMs modeling for highly imbalanced classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B, 39, 1, 281-288.10.1109/TSMCB.2008.200290919068445Search in Google Scholar

[29] Tomašev, N., Mladenic, D., Class imbalance and the curse of minority hubs, Knowledge-Based Systems, 2013, 53, 157-172.10.1016/j.knosys.2013.08.031Search in Google Scholar

[30] Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J., Herrera, F.: ROSEFW-RF: The winner algorithm for the ECBDL’14 big data competition: An extremely imbalanced big data bioinformatics problem, Knowledge-Based Systems, 2014, 87, 69-79.10.1016/j.knosys.2015.05.027Search in Google Scholar

[31] Wah, Y. B., Abd Rahman, H. A., He, H., Bulgiba, A.: Handling imbalanced dataset using SVM and k-NN approach, in: AIP Conference Proceedings, 2016, 1750 (1), 020023.10.1063/1.4954536Search in Google Scholar

[32] Wilk S., Stefanowski J., Wojciechowski S., Farion K., Michalowski W., Application of preprocessing methods to imbalanced clinical data: An experimental study, in: Proc. of the 5th International Conference on Information Technologies in Biomedicine (ITiB 2016), Vol. 1. Springer, 2016, 503-515.10.1007/978-3-319-39796-2_41Search in Google Scholar

[33] Xie, T., Yu, H., Wilamowski, B.: Comparison between traditional neural networks and radial basis function networks, in: 2011 IEEE International Symposium on Industrial Electronics. IEEE, 2011, 1194-1199.10.1109/ISIE.2011.5984328Search in Google Scholar

eISSN:: 2300-3405
Language:: English

Publication timeframe:: 4 times per year
Journal Subjects:: Computer Sciences, Artificial Intelligence, Software Development

Journal RSS Feed

Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data

Published Online: Jun 16, 2017

Page range: 149 - 176

Received: Oct 19, 2016

Accepted: Apr 24, 2017

DOI: https://doi.org/10.1515/fcds-2017-0007

Keywordsimbalanced data, difficulty factors, preprocessing methods, learning and classification

© by Szymon Wilk

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Keywords
imbalanced data, difficulty factors, preprocessing methods, learning and classification