A Comparison of Variables Selection Methods and their Sequential Application: A Case Study of the Bankruptcy of Polish Companies

Abstract Research background: Even though in recent decades, a lot of new techniques were developed, there is still a lack of studies aimed at comparing the performance of variable selection methods. Bankruptcy prediction is an excellent example of the conservative research field with the tendency to use classical approaches. Although the results of studies in this field are directly applied in banks and other financial institutions, variables selected for these models can be biased by the author’s preference for one technique. Purpose: This work aims to compare different variable selection approaches and introduce a new methodology of sequential variable selection that can be applied when the low-dimensional model is preferred. Research methodology: This study has been conducted on Polish companies’ insolvency data from the period of 2007–2013. The risk has been modeled with logistic regression; hence variables have been selected with approaches suitable for linear models. Results: The one-step methods did not lead to sufficient dimensionality reduction, while the sequential approach provided compact models keeping the high-performance level. Also, this method allowed us to identify the main financial determinants of insolvency for studied companies, which are the volume of total assets and the ratio of profit to total assets. Novelty: This paper compares different variable selection methods and demonstrates the effectiveness of their sequential application for dimensionality reduction.


Introduction
Research of bankruptcy estimators do not have a clearly defined standard of variable selection. Some authors use a pool of variables that are selected and based on theoretical studies, which can cause a lack of empirical quality of the model. In other studies, authors start from a large number of variables and decrease the dimensionality of the model using only one preferred technique. However, this approach can cause suboptimal model specification. This paper investigates the performance of different variable selection methods in case of modeling the risk of companies' bankruptcy.
Model predictive power is strongly dependent on data availability. The most complete source of data available for researchers is the financial statements of joint-stock companies. The obligation to publish statements allows getting relatively large sets of diverse financial data. It leads to a high number of potential determinants and results in the problem of variable selection. Among variable selection methods, logistic regression and discriminative analysis are two of the most commonly used and both require independence of explanatory variables. Otherwise, there is no theoretical guarantee that estimators will be unbiased. However, most financial indicators are highly correlated which creates an additional challenge in using these two methods.
This study was made on the data containing financial indicators of Polish companies. Even in the case of only Polish data, the studies conducted over the last 20 years differ in terms of the variables used, and hence it is hard to select a specific set of determinants based on them.
In addition to variable selection based on theoretical literature, there are two common quantitative approaches: selection with the t-statistic and stepwise regression, which were compared in this study with other approaches. In practice, the specifications obtained by researchers may vary even in the case of models estimated with an identical approach. As a result, it is hard to predict which set of variables will form the optimal specification. However, some typical variables appear in different models, therefore this work compares all obtained models as well as the results of other authors.
Modern studies of bankruptcy risk estimation vary considerably in terms of different estimation approaches. Some researchers still use the discriminative analysis proposed by E.I. Altman (1968), while in recent works, experiments with such new modeling methods as neural networks (Iturriaga, Sanz, 2015) and random forest (RF) classifiers (Barboza, Kimura, Altman, 2017) appeared. They are characterized by better predictive power than linear or decision tree models. However, higher complexity leads to difficulties in the inference of the effects of explanatory variables, which directly influences the possibility to analyze obtained specifications from a theoretical perspective. In this work, we focus on the theoretical and quantitative interpretation of the results. Therefore, variables were selected for the logistic regression model, as it provides an interpretable estimation of variables effect for a binary target. An alternative approach is decision trees (DTs) based methods. For example, in the work of L. Obermann and S. Waack (2015), the authors used DTs to demonstrate that easy interpretable methods are comparable to more complex black-box methods for insolvency prediction. Since we decided to model insolvency with logistic regression, methods based on more complex models, like random forests or neural networks, will be omitted as they consider non-linear dependencies, which are not possible to fit with logistic regression without additional feature engineering. Therefore, while selecting different approaches, we had to choose the ones suitable for linear models. This requirement leads to the selection of methods that are well-suited for simpler models like linear or logistic regression. Hence, methods that are based on more complex models, like random forests or neural networks, consider non-linear dependencies, which are not possible to fit with logistic regression without additional feature engineering. Therefore, while selecting different approaches, we had to choose the ones suitable for linear models.
The main purpose of this paper is to empirically compare prevalent variable selection methods with newer and less common ones. The choice of the bankruptcy prediction is motivated by the fact that this is a well-studied problem both from a theoretical and modeling perspective.
This work is structured as follows: section 2 presents the results of the studies of bankruptcy risk estimation, section 3 contains the results of the empirical study, while section 4 concludes these results.

Bankruptcy risk estimation studies
The fact that the difference in financial indicators characterized the threat of bankruptcy was noticed in R. Smith and A. Winakor (1935) and C. Merwin (1942) studies. This notice became a fundamental theoretical basis for modeling the risk of bankruptcy. Next, the Altman's Z-score and J. Ohlson's (1980) O-score formed one of the standard approaches for bankruptcy risk measurement, proved that financial data contains this information. According to the J. Sun to build a model out of 36 and 72 variable sets. In this study, the main focus is the quantitative selection, so all these techniques together with simulated annealing (Rutenbar, 1989) and a genetic algorithm (Shin, Lee, 2002) were compared using a single data set.
Most of the variables in bankruptcy prediction models are related to liquidity, debt service capability, and profitability. For example, we can compare the classical Altman's model, the first insolvency logit model created by Ohlson and Hołda's logit model created for Polish companies. All of them use profit and volume of assets related variables. Both of the logit models have total liabilities in their specifications. In his empirical study, J. Traczynski (2017) concluded using the BMA approach that the ratio of total liabilities to total assets and the volatility of market returns are only significant variables in all industry groups and overall samples. Even though the author used a non-classical approach, this result is consistent with classical models. However, these models do not have exactly the same variables -each author represents the same indicators differently. It leads to a situation when the authors of empirical studies need to create and select variables in their work. This study is dedicated to the variable selection part of this process, especially to the case with the availability of a lot of financial indicators when an author needs to select the best predictors out of them.
As was shown, researchers of insolvency use various methods that lead to the following research question: What is the comparative performance of these methods in the case of the same data set? Also, taking into account the fact that different studies have variables describing the same features of the company, we can hypothesize that it is possible to highlight several most important determinants of insolvency which will be consistent with previous studies. The second hypothesis is that the sequential application of different variable selection methods can allow getting a higher reduction in dimensionality than a single model approach. A reduced number of variables also will be helpful in highlighting key indicators of bankruptcy risk.

Data pre-processing
The After the data cleaning process 56 financial indicators and 5,538 companies left, 388 of which declared bankruptcy. This data was stratified sampled with a 0.7 train-test split ratio, a separate test subset allowed to estimate the predictive power of the model on new data.

Removal of correlated variables
First, the removal of correlated variables was applied to the dataset as one of the variable selection methods. Pairs of variables with a correlation higher than 0.7 were selected and variables with a higher average correlation in the correlation matrix were removed. This approach allowed reducing dimensionality from 56 to 23 variables. Logistic regression with these variables was compared with the model containing a full set of them. We can notice the Area Under Receiver Operating Characteristic Curve (ROC-AUC hereinafter AUC) drop from 0.638 to 0.5167 for the new model. However, the t-test shows the statistical significance of all coefficients. The new model's AUC is close to 0.5 which indicates the poor quality of this variable selection approach.
At the same time, dimensionality reduction is not sufficient to distinguish the main bankruptcy determinants. Therefore, of these two models, the one containing all indicators will be used as a baseline at the first stage of variable selection.

First stage of variable selection
At the first stage, the following variable selection approaches were used: LASSO, elasticnet, stepwise regression, BMA, GA, and SA. Also, ridge regression and regression on variables selected with principal component analysis (PCA) were used to illustrate a performance achievable by less interpretable linear models (Jolliffe, 2011). For the second model, 16 principal components were selected and based on the 90% of explained variance threshold.
Then statistically non-significant variables were rejected which led to a model with 3 principal components.
Specifications gained by different algorithms were compared with the AUC and Akaike information criteria (AIC) (Friedman, Hastie, Tibshirani, 2013). Performance metrics and a number of variables for models obtained in this stage are shown in (Table 1). The number of variables for models built on principal components is marked with an asterisk, to highlight that these models require information from all original variables. Also, AIC values cells for models with regularization are empty as information criteria do not have an interpretation for them. and a relatively small number of variables. Nevertheless, performance on the test data is below average which means that the model has the highest over fitting. BMA results can be considered as optimal from the perspective of further theoretical interpretation. Figure 1 illustrates the probability of 35 the most likely specifications, where on the vertical axis are variables which appeared in them and on the horizontal axis is the cumulative probabilities of these variable sets. Cell color indicates a sign of parameter: red is positive, blue is negative. The figure shows the same number of variables for the 4 most probable models, so BMA clearly indicates a smaller dimensionality than the other approaches. Also, the posterior probability of the selected model is equal to 76%, which indicates a confident choice of this specification over others.

Second stage of variable selection
At the next stage, only the BMA and LASSO regression were used. Variable selection was is the LASSO regression on the genetic algorithm set which converged to the same specification as LASSO regression at the first stage. Therefore, these two models will not be considered in the further part of this work. The last LASSO regression was performed on a simulated annealing variables set and reduced dimensionality from 21 variables to 10 which makes it one of the biggest specifications. However, the performance of this model is below the average level with the lowest testing AUC (0.7902) among all of the compared models.
The next half of this stage is the BMA approach for variable selection with the same order that the first stage models used. In the case of BMA on backward stepwise regression results, the model converged to specification with 8 variables. In Figure 2 we can see the posterior probabilities of the two most probable models which have close values and differ only by one variable (Attr14 and Attr7). Similar probability values can be explained by the high correlation between them which almost equals 1, whereas Attr14 is (gross profit + interest) / total assets and Attr7 is gross profit / total assets. So, the influence of the interest on the difference between these indicators is not significant and makes them interchangeable from the perspective of the bankruptcy prediction model. Similar behavior can be observed for BMA on the simulated annealing variables set. The resulting model contains 7 variables and has an average performance according to all metrics. The posterior probability of this model is 0.25, whereas the second most probable model has 0.23. These models also differ by two variables, which also are highly correlated and represent the proportion of operating profit divided by sales revenues, but in a less probable case the operational depreciation of fixed assets was added to the numerator.  Table 2). The results indicate that BMA leads to the selection of models with better performance than LASSO regression. The model with the best fitting to the train set as well as to the test set was obtained by the Bayesian approach. However, LASSO regression converged to a much smaller specification than others provided a good performing model with only 3 variables. Also, it is worth noting that the first stage BMA allowed getting the model with the performance and variable reduction to a level of the second stage.  Attr10 Equity / total assets -4 Attr32 (Current liabilities × 365) / production cost of products sold + 4 Attr48 (Operating profit -depreciation) / total assets + 4

Attr22
Operating profit / total assets + 3 Source: own elaboration. Table 3 presents the financial indicators occurring 3 or more times in models shown in (Table 2). Signs of the parameters are also presented in this table and they all are identical among models. All but the last two shown variables are consistent with intuition and theory.
The logarithm of total assets occurs in all models except BMA estimated on GA specification.
4 out of the 6 remaining variables are normalized by the total assets value, for example, Attr35, which is profit on sales / total assets, occurs in all second stage models. Variables from this profit and no change in profit from sales. Such change occurs when a company receives a higher profit not by increasing sales but by reducing administrative costs, which is the typical behavior of management in difficult financial situations.

Conclusions
This work compares different variable selection approaches including non-classical methods like BMA, GA or SA. The best set of variables from the perspective of performance were obtained by the backward stepwise regression. However, this model contains 33 variables which is one of the largest specifications in this study. The multistage approach used in this study consisted of two stages. The results of the first stage indicated the lack of sufficient dimensionality reduction among the most used models, whereas the second stage provided more compact models without a significant performance drop. At the first stage, BMA was the best solution for a single variable selection method which allowed getting a model with 7 variables.
Our results demonstrate the efficiency of a multi-stage approach, the second step of which was the estimation of the LASSO and BMA models on variable sets from the previous stage.
As expected, it allowed further dimensionality reduction. For instance, the specification obtained with the LASSO regression estimated on the stepwise regression set from the previous stage contained 3 variables with a relatively good performance level. Other models have higher dimensionality; however, the performance is better than the BMA model from the first stage. The best-fitted specification was obtained by the application of BMA to the backward stepwise specification, which performed the best at the first stage. Therefore, the sequential application of different variable selection methods allowed achieving better results from both perspectives. Specifically, the application of BMA to the best performing model from the first step.
The frequency analysis of the second stage models variables shows the tendency to converge to financial variables describing the same aspects of companies' condition. These aspects are profit on sales, total assets, inventory, equity, current liabilities, and operating profit. The pool of the variables, as well as their signs, are consistent with the results of other empirical and theoretical studies. The volume of total assets is present in 7 out of 8 models and all of them have a company's profit normalized by total assets as a variable. This result indicates that these two indicators are the main financial variables in the case of insolvency prediction.