The Influence of Unbalanced Economic Data on Feature Selection and Quality of Classifiers

Abstract Research background: The successful learning of classifiers depends on the quality of data. Modeling is especially difficult when the data are unbalanced or contain many irrelevant variables. This is the case in many applications. The classification of rare events is the overarching goal, e.g. in bankruptcy prediction, churn analysis or fraud detection. The problem of irrelevant variables accompanies situations where the specification of the model is not known a priori, thus in typical conditions for data mining analysts. Purpose: The purpose of this paper is to compare the combinations of the most popular strategies of handling unbalanced data with feature selection methods that represent filters, wrappers and embedded methods. Research methodology: In the empirical study, we use real datasets with additionally introduced irrelevant variables. In this way, we are able to recognize which method correctly eliminates irrelevant variables. Results: Having carried out the experiment we conclude that over-sampling does not work in connection with feature selection. Some recommendations of the most promising methods also are given. Novelty: There are many solutions proposed in the literature concerning unbalanced data as well as feature selection. The innovative field of our interests is to examine their interactions.


Introduction
The problem of unbalanced datasets in supervised classification occurs when examples of one class are rare compared to other classes. This happens quite often even in the case of binary classification, which is considered in this paper. Just to mention applications such as bankruptcy prediction, churn analysis, the prediction of positive responses in direct marketing campaigns, or fraud detection. When examples of one class are rare, the classifiers tend to accurate the prediction of the opposite class. This means that most of the rare examples are misclassified while the majority class is recognized quite well. In extreme cases, the models can classify all objects into the majority class. The explanation of this phenomenon depends on the learning method. In logistic regression, the posterior probabilities of minority class are underestimated (King, Zeng, 2001). In turn, recursive partitioning learning algorithm minimizes overall the error regardless of class. Note that the correct classification of rare examples is of particular importance in the mentioned applications. The model that accurately predicts the bankruptcy of enterprises contributes to investment risk reduction. Detection of frauds, which are a marginal part of overall transactions, protects against losses. Moreover, it affects customer trust and the bank's image. In a churn analysis, resigning customers are usually a small fraction of all customers. A lot of attention is paid to them, because customer retention is much cheaper than getting a new one. That is why particular offers are prepared to unsure clients and they are often individually suited to them. The cost of direct marketing action decreases when more potentially resigning customers are identified.
The unbalanced learning problem has gained a great deal of interest recently. Providing a complete review of the literature falls beyond the scope of this paper. The overview of existing solutions with wide references can be found in the works of (Chawla, Japkowicz, Kołcz, 2004;Weiss, 2004;Galar, Fernandez, Barrenechea, Bustince, Herrera, 2011;Longadge, Dongre, Malik, 2013;Haixiang et al., 2017). These solutions can be divided into two groups, where changes are introduced at a level of data or learning algorithm. Note that this classification is not separable. Consider the balancing methods of resampling. They are usually used in the pre-processing step to prepare a new training set, which would be balanced. However, the same techniques can be applied inside a learning algorithm, namely for each base model in ensemble learning. Balanced random forests can be cited as an example (Chen, Liaw, Breiman, 2004), where the training sample is balanced in each iteration.
It should be noted that the assessment of models, obtained from unbalanced data, requires other quality metrics than the overall classification error. The most popular are: sensitivity (a.k.a. recall), specificity, precision, AUC measure (which is a mean of accuracy calculated for both classes separately), or F-measure that combines sensitivity and precision (Fawcett, 2006). All of them distinguish incorrect classifications of both, minority and majority class. The extensive study of model assessment in the context of unbalanced data is given in (Japkowicz, Shah, 2011).
The next problem of building the models with a high accuracy of classification are irrelevant variables, i.e. those that have no impact on the response variable (Guyon, Gunn, Nikravesh, Zadeh, 2006). The relevancy of a variable is used to be defined as individual or contextual with other variables. Irrelevant variables can lead to over fitting, decreasing the generalization ability of the models. In addition, the accuracy of the model parameter's estimation decreases with an increase of dimension (curse of dimensionality). Frequently, the goal of analysis is the discovery of unknown relations in big datasets, when a researcher does not have sufficient prior knowledge about which features really influence a dependent variable. Therefore, a feature selection is an important stage that may be formulated as an optimisation problem, namely the task of finding a subset of predictors that would give the best classifier (Tsamardinos, Aliferis, 2003).
The modeling of classifiers is particularly difficult when the described problems occur simultaneously. We suppose that unbalanced data influence the feature selection process. The goal of this paper is to examine the relation between balancing techniques and feature selection in order to build finally an optimal classifier. We consider seven techniques of handling unbalanced data, and seven feature selection methods, which represent three main approaches to this problem: filters, wrappers and embedded methods. In the presented empirical study we use real datasets with artificially added irrelevant variables. This made it possible to observe the relationship between the balancing of training sets and variable selection. Our study allowed us to draw a few conclusions and recommendations on the use of discussed methods.
The rest of this paper is organized as follows. Section 1 shortly presents the most popular strategies of handling unbalanced data. Section 2 is devoted to feature selection problem.
The setup and the results of our experiment are presented in Section 3. Then in Section 4 we include the concluding remarks.

How to handle unbalanced data?
Before presenting the popular ways of dealing with unbalanced data, we consider some simple model quality measures that distinguish between types of misclassifications. We introduce terminology and notations commonly used in the literature (Fawcett, 2006). is called true negative rate, or specificity. The balance between these two kinds of accuracies is reflected in the AUC measure, which is simply their arithmetic mean.
Two general approaches to solving the problem of unbalance are changes in data structure or the modifications of learning algorithms. The first is to prepare a new, balanced training , which is approximated by the kernel estimate: where j indicates a class, n j is its size and H j is a matrix of scale parameters in the chosen class. Under-as well as over-sampling can be performed to reach a pre-defined number of observations in both classes. The algorithm is partially random because the objects (x i ,y i ) of the original dataset are drawn randomly with the same class probabilities. The most popular algorithm that uses distances between objects is SMOTE (Chawla et al., 2002). The second approach of dealing with unbalanced data is a modification of the learning algorithm. In practice, the change of estimation criterion is not an easy solution. It requires thorough knowledge on the classification method and programming skills. Assigning different costs to the incorrect classification also is usually problematic due to insufficient knowledge on the examined phenomenon. The simplest way is to run learning an algorithm for original data and to shift the classification cut-off point for posterior probabilities.
The techniques described above may be used in combination with certain methods of cleaning the data, which focus on the noisy examples or overlapped regions. As an example we can present I. Tomek links method (Tomek, 1976). This method searches for pairs of points that are the nearest neighbours to each other but represent different classes. Then both or one example from the majority class is removed. The result is that all pairs of nearest neighbours in the data belong to the same class.

Feature selection
The task of the feature selection can be formulated as a problem of combinatorial optimisation. Given some quality criterion ( ) Q Q S = we are to search for the feature subset S ⊆ X , so that the function Q(S) reaches its optimum. Due to the relationship between the search process and the estimation of model parameters, feature selection methods are currently grouped into three approaches: filters, wrappers and embedded methods (Guyon et al., 2006).
The filters work as a pre-processing step and the criterion Q is not directly connected with model quality. The most popular is an evaluation of the individual impact of a predictor on a dependent variable. This approach is fast, easy to implement and the process of the search is reduced to ranking the variables. This may be performed by statistical significance tests. In the case of a qualitative response variable, the Wilcoxon rank sum test or chi-squared test can be applied, depending on the type of predictor. The significance level decides which variables are treated as relevant. The next popular group of filter criteria use entropy measure: which express the information content in the distribution. The measure (2) assumes qualitative or discrete variables. The continuous variables should be discretised. One of the most popular entropy based measure is a normalized version of mutual information, which is called symmetrical uncertainty: Note, that applying this criterion results only in the ranking of variables, and it is necessary to set a certain relevance threshold. A relatively fast algorithm that eliminates irrelevant as well as redundant variables was introduced by L. Yu and H. Liu (2004 if there is a correlation between features or not, a threshold value is introduced. A separate group of filters represents a multidimensional approach. Some of them use distances between objects. The classical example is the ReliefF algorithm (Kononenko, 1994). It assigns weights to the variables in the iterative procedure, in which example x is sampled from a training set and its k nearest neighbours from the same class (nearest hits NH) are searched, as well as k nearest neighbours from the opposite class (nearest misses NM). Then the weight of the variable is updated according to the coordinate differences between x and its nearest neighbours: The algorithm has two input parameters: k and the number of iterations. The recommendations of their settings are given e.g. in (Kubus, 2016). The obtained weights give a feature ranking, thus the cut-off point for relevance has to be set yet.
The second group of feature selection methods is constituted by so called wrappers. Model assessment is employed for the evaluation of feature subsets. Searching the space of all subsets is performed as an outer loop of the learning algorithm. This approach can have a high computational burden, even for heuristic searches. The most popular representative of wrappers is the stepwise procedure, which may be applied in two directions. Forward selection adds a variable to the current subset in each iteration so that it optimises the model quality criterion.
On the contrary, backward elimination discards one variable from the current subset in each iteration. This second variant starts running with a full set of predictors and is much more computationally expensive, especially when the complexity of learning algorithms is high.
To accelerate this searching process recursive feature elimination was proposed in the context of support vectors machines (Guyon, Weston, Barnhill, Vapnik, 2002). M. Kubus (2015) emphasizes that recursive feature elimination is a general procedure, which can work with many discrimination or regression methods. In the context of logistic regression, this algorithm can be found to be popular and is willingly used by practitioners with the technique of iterative eliminating variables corresponding with insignificant coefficients.
The last approach to the task of feature selection is to place this mechanism inside the learning algorithm (embedded methods). Searching for the best subset and learning a model where L(b) is a likelihood function. The higher the coefficient values, the greater the penalty value. This results in the shrinking of coefficients to zero. In extreme cases they can be equal to zero, which means the feature selection. The penalty term called elastic net: was proposed by H. Zou and T. Hastie (2005) and it combines historically earlier propositions: the ridge regression and lasso. The penalty parameter λ influence the amount of shrinking.
In practice, several models are estimated for different values of λ and finally this one is chosen which minimizes the classification error or information criterion.

Empirical study
The unbalanced data we have used in the empirical study come from the UCI Machine Learning Repository (Dua, Graff, 2019). Our intention was a choice of economic data that represent popular applications, such as the prediction of potential customers in a direct marketing campaign, churn analysis, and bankruptcy prediction. A summary is shown in Table 1. Each of them represents a binary classification problem. The nominal variables in the Bank marketing dataset were transformed into binary variables for the purpose of logistic regression models. Note that there were a lot of missing data in the set Polish bankruptcy.
We removed such rows so that this additional problem would not influence the results. There  Due to the importance of economic interpretation we considered the logistic regression model, which is also characterized by a rapid estimation process and it is sufficient when a class structure is not very complicated. On the other hand, we compared the results with random forests. These classifiers have very high prediction accuracy and perform automatic feature selection, but they act like a black box, which is a disadvantage when interpretation is of particular interest. An additional advantage of random forests is the low number of hyperparameters. When building trees without pruning, it is enough to determine their number and the number of variables randomly drawn at the tree nodes. Setting these parameters is suggested in a source article (Breiman, 2001). In our investigation, the number of variables randomly selected in the nodes is approximately a square root of the number of predictors. The number of trees in the forest is set as 200. The model quality measures are estimated via a 10-fold cross-validation.
It is noteworthy that the split into 10 folds were made only once so that all methods could be compared on the same training and test sets. SMOTE: the number of nearest neighbours during the sampling process was set as 5.
The positive class was over-sampled to obtain equinumerousity.
Cut: a direct learning on original unbalanced training sets and the use of shifted cut-off point for classification. The cut-off point was set as the fraction of the minority class in a training set.
TL-Cut: as above but the data was pre-processed with the Tomek links technique.
TEST: filtering of the features using statistical tests (Wilcoxon rank sum test for quantitative features and chi-squared test for qualitative). The significance level was set as 0.1.

SU:
filtering of the features with the symmetrical uncertainty measure. We set the threshold of feature relevance as 0.01. The quantitative variables were discretised following MDL method proposed by U. Fayyad and K. Irani (1993). The threshold of feature relevance as well as of inter-correlation between predictors was set as 0.01. Discretisation as in SU.

RFE: recursive feature elimination where all variables corresponding to insignificant
coefficients at a level of 0.05 are eliminated in each iteration.
STEP: forward stepwise regression. As the optimisation criterion we used the Bayesian information criterion BIC.
RLR: regularized logistic regression with elastic net penalty. Alpha parameter in eq. (6) was set as 0.9 so that to assign more weight to the term with absolute value, which decides about feature selection effect. Penalty parameter λ was determined according to the minimal value of the Bayesian information criterion BIC.
The first stage of our research was to examine the impact of balancing techniques on feature selection process. We also considered feature selection on unbalanced training sets for The feature selection in unbalanced training sets (Cut & TL-cut) deserves a separate discussion. In these cases, the numbers of irrelevant variables selected are comparable to under-sampling, except for the Polish bankruptcy dataset. Although slightly worse results were obtained after cleaning the data with Tomek links, but this mainly concerns to the Bank marketing dataset. We can suppose that this is because of many binary variables in this dataset. For these reasons, we will incorporate these techniques in further research. Table 2 shows the numbers of undetected irrelevant variables that correspond to the methods of feature selection.   Table 3. To assess the significance of the differences between the means of AUC (or TPR) we conducted the Friedman rank sum test, separately for each technique of handling  unbalance and for each dataset. The null hypothesis was rejected in all cases, thus a post hoc analysis was carried out with the use of the Nemenyi test. Due to the large number of results, we have presented them in the summary version. We made the scoring of feature selection methods according to the following principles. The method is given a point if its mean of AUC (or TPR) does not differ significantly from the best result. The level of significance has been set as 0.1.
The results summed over three datasets are presented in (Table 4). The winners in this scoring are SU, STEP and RLR. The first two methods are simultaneously among the top three from the point of view of identifying irrelevant variables (see Table 2). The third method, FCBF has failed in the comparisons of AUC and TPR. This is because it discards features that are too radical, removing in this way important information. Regularised logistic regression is among the winners, although it has not eliminated irrelevant variables perfectly. We guess that this could be because of the low values of coefficients that correspond with irrelevant variables. Noteworthy is that TEST failed only once, namely for TL-cut in the Polish bankruptcy dataset. This method is simple and fast, thus it deserves for a closer look. Experiments with a more radical level of confidence would be welcomed in future work. The RFE algorithm has yielded great results for the Bank marketing dataset but it has failed for the other datasets. In the next stage of the empirical study we compared the results obtained with random forests (RF). We ran this algorithm on original, unbalanced data as well as on those cleaned with Tomek links (TL RF). In addition, balanced random forests (BRF) were also compared.
This method is the adaptation of random under-sampling to the ensemble learning. It means that resampling is performed separately for each base model (Chen et al., 2004). The results are presented in

Conclusions
Unbalanced data as well as irrelevant variables constitute a serious problem in modelling useful classifiers. Both problems have been taken into account in our empirical study, and the results that have been obtained allow us to draw some conclusions. Over-sampling does not work in the context of feature selection. It is meaningful that even "intelligent" versions, that use information from the data, have failed in the presence of irrelevant variables. Besides, we can note that the lower the fraction of the minority class, the more difficult the task of the feature selection is. While in the churn analysis, the fraction of resigning customers is often at the level of a dozen or so per cent, then in forecasting bankruptcy it is not the whole 5 per cent. Depending on the prediction horizon (one or two years) it is from 0.6 to 4.6% in the Polish economy (Pociecha, Pawełek, Baryła, Augustyn, 2014). Note, that the fraction of the minority class in credit card fraud detection is hardly 0.1%.
The superiority of random under-sampling performance is actually beneficial in processing large datasets. The learning of classifiers with the use of original (or cleaned by Tomek links) datasets may also be recommended, and these approaches at least do not increase the sizes of training sets. Among the most promising feature selection methods for unbalanced data we can list: filtering by symmetrical uncertainty measure, forward stepwise regression and regularised logistic regression. The first method is model free, fast and can be especially recommended for large datasets. Note that this filter could be applied also in multiclass classification problems. In turn, the forward stepwise procedure, which have occurred to be perfect in the recognition of irrelevant features, may be computationally expensive in high dimensional problems. Again, combining the stepwise procedure with under-sampling can be advantageous. It seems that due to the computational cost, the subset of features filtered by symmetrical uncertainty can be set as the starting point in the stepwise search. We consider this to be one of the directions of future work. Using tests of statistical significance is also worthy of further research. In our study this has yielded nearly optimal results. Nevertheless, if interpretation is not very important, using random forests is definitely recommended.
This method performs automatic feature selection and does not require the transformation of qualitative features into dummy variables in the pre-processing step. Let us also note that the results obtained after applying the Tomek links were usually comparable with those obtained for original data. As this cleaning technique is not very time consuming it is worthy of being taken under consideration during analysis.