Impact of the Regularization of Regression Models on the Results of the Mass Valuation of Real Estate

Abstract Research background: Mass appraisal is a process in which multiple properties are appraised simultaneously, with a uniform approach. One of the tools that can be used in this area are multiple regression models. In the valuation of real estate features are often described on an ordinal or nominal scale. Replacing them with dummy variables with an insufficient number of observations leads to multicollinearity. On the other hand, there is a risk of overfitting the model. One of the ways to eliminate or weaken these phenomena is to introduce regularization based on a model’s penalization for the high values of its weights. Purpose: The aim of the study is to verify the hypothesis whether regularized regression reduces the errors of property valuation and which of the analyzed methods is the most effective in this context. Research methodology: The article will present a study in which two ways of regularization will be applied – ridge and lasso regression, in the context of their impact on the errors of property valuation. The analyzed data set includes over 300 land properties valued by property appraisers. The key aspects of the study are the selection of optimal values of the regularization parameter and its influence on model’s errors with a different number of observations in the training sets. Results: The study showed that regularization improves valuation results and, more specifically, allows for lower average absolute percentage errors. The improvement of model effectiveness was more pronounced in the case of ridge regression. An important result is also that regularization has provided a higher accuracy of valuation compared to multiple regression models for smaller training sets. Novelty: The article confirms the effectiveness of regularization as a way to eliminate the problem of multicollinearity or overfitting of the model. The results showed that ridge regression can be an effective way of modelling the value of real estate. Especially in the case of a small amount of market data, which is an important conclusion in the context of the real estate market.


Introduction
In the practice of real property valuation two main trends can be distinguished: individual and mass appraisal. In the process of an individual appraisal the entity valuing real estate focuses on one real property or on a small number of properties. Whereas in the case of mass appraisal the subject of an appraisal involves a large number of real properties of one type (e.g. Hozer, Kokot, Kuźmiński, 2002). When it comes to mass appraisal, the choice of method depends on the objectives and conditions regarding real properties. For example, in Poland, the legislator introduced three main objectives of mass property valuation: general property taxation, updating of perpetual usufruct fees and assessment of the economic effects of adopting and amending zoning plans. Mass valuations can be also useful for e.g. banks, which from time to time update the value of real estate, which are the basis for mortgage collateral. Mass appraisal methods can also support investment decisions. In practice and in the theory of real property mass appraisal, many models and algorithms can be differentiated (Jahanshiri, Buyong, Shariff, 2011).
The most frequently used models of property valuation are multiple regression models. Their popularity in property valuation, but also in other areas, results mainly from their simplicity and ease of interpretation. However, these advantages have a price to pay. In order to build a good model, a number of conditions need to be met (egg. Doszyń, 2012). One of the problems that may occur when estimating multiple regression models is the multicollinearity of variables and model overtraining due to the insufficient number of observations. One way to reduce undesirable effects is to regularize the model, which is typically achieved by constraining its weights (structural parameters). The article will present a study in which two ways of regularization (ridge regression and LASSO) will be applied and their influence on the errors of real estate valuation will be presented and discussed.
The aim of the study is to verify the hypothesis whether regularization reduces errors in property valuations and which of the analyzed methods is more effective in this context. The study also included the aspect of the size of the data set (training set), on the basis of which models are way of modelling the value of real estate. Especially in the case of a small amount of market data, which is an important conclusion in the context of the real estate market.
Keywords: property valuation, market analysis, regularization JEL classification: C10, R30 estimated. It will be investigated whether the influence of regularization on valuation errors is stronger when the training set counts less observations. The subject of the study was 318 properties located in Szczecin -one of the largest Polish cities. These properties were subject to valuation due to the revaluation of perpetual usufruct fees.

Literature review
The general review of quantitative methods used in mass appraisal could be found in (Pagourtzi, Assimakopoulos, Hatzichristos, French, 2003). In the article the methods are divided into traditional (multiple regression, comparable, cost, income, profit, contractors methods) and advanced, such as ANN (Artificial Neural Networks), hedonic pricing methods, spatial analysis, fuzzy logic, and ARIMA models.
An interesting comparison of modern approaches in mass appraisals is presented in (McCluskey, McCord, Davis, Haran, McIllhatton, 2013). In the survey such modelling approaches as multiple regression (MLR), spatial autoregression (SAR), geographically weighted regression (GWR), and ANN are compared. The neural network is widely used in the real estate market. The most common cases regard valuation, but market rents are also modelled (Muczyński, Walacik, 2017).
T. Kauko and M. d'Amato (2008) classify appraisal methods into four groups: model driven methods, data driven methods, methods based on machine learning and expert methods.
Econometric methods are sometimes also used not directly in appraisal but, for example, to identify outlier transactions (e.g. Doszyń, Gnat, 2017).
From the perspective of this study, the most important publications are those on the use of regularized models in the real estate market. It should be stated here that the literature in this area is not broad. The paper (Kubus, 2016) presents the possibility of using the local regularization of regression models. The proposed procedure has proved to be effective. However, this study concerns modelling on the basis of a large dataset. In the real estate market, it is not uncommon that the number of available transactions is very limited. Therefore, the presented application of regularization in the case of a small set of data on real estate is a novelty in the scope of the mass valuation of real estate.

The Szczecin Algorithm of Real Estate Mass Appraisal
As was previously mentioned, there are a number of methods of mass appraisal. One example of such a method is the Szczecin Algorithm of Real Estate Mass Appraisal (SAREMA).
In the survey, an econometric form of this algorithm constitutes a point of reference to regularized models: where: x kpi -zero-one variable for p-th state of attribute k, α j -market value coefficient for j-th location attractiveness zone, laz ji -dummy variable equal one for j-th location attractiveness zone, The explained variable is a natural logarithm of a real estate unit value. Real estate values are determined by certified appraisers in the individual appraisals. Real estate attributes are qualitative characteristics measured on an ordinal scale, so they are introduced into the model (1) through dummy variables for each state of an attribute.
In the model (1) there is a constant term. In order to avoid the strict multicollinearity of the explanatory variables, each dummy variable for the worst attribute states are skipped. Hence the summation of p = 2, ..., k p in the formula (1). In the interpretation, the ignored state of an attribute serves as a point of reference for the remaining states.
There are also market value coefficients (α j ) in the model (1). They could be treated as a proxy for location. They are estimated by introducing dummy variables for each location attractiveness zone. Location attractiveness zones are constructed by experts. They are defined as areas with a similar impact of location. Therefore, location attractiveness zones are constructed in such a way that the impact of location in the given area is homogenous.
Because of the strict collinearity of explanatory variables the worst (cheapest) location attractiveness zone is skipped. The omitted location attractiveness zone creates a point of reference.

Linear models regularization
Regularization is achieved by setting constrains for the weights of the model. Different kinds of algorithms implement those constraints in different ways. Two types of regularization will be used in this article. The first one is ridge regression and the second one is lasso regression.   The selection of the value of this parameter can be carried out in a number of ways. From a completely random, arbitrary value, to the use of different machine learning techniques to optimize this parameter. In this study, the aim of adopting the appropriate strength of regularization is to obtain the best possible valuation results. The accuracy of the valuations obtained with estimated models will be determined by comparing them with the valuations carried out by certified property valuers.

Empirical study
The described SAREMA procedure will be used for the appraisal of 318 land plots located in the northern part of Szczecin, which is the capital of the West Pomeranian voivodeship, one of 16 Polish voivodeships. The real properties constitute a set for which an update of annual perpetual usufruct charges was conducted. The real properties were located in three clusters (referred as LAZs) of various numbers of real properties. The area within which the appraised real properties lie is shown in Figure 2.
Attributes describing properties and their states are presented in (Table 1). It could be noted that all attributes, except plot area, are qualitative variables. They are introduced into econometric model (1) as a dummy variable for each state of an attribute (with the exclusion of the first, worst, state). The land of a plot area is a quantitative variable, but it is treated as a qualitative one. This is because market participants often treat this variable in this way.
This conclusion stems from appraisers. With respect to real estate unit value, it is assumed that a small area is better than average and average is better than large.  The accuracy of the valuations will be assessed on the basis of the absolute percentage error (APE): where: w i -the actual unit value of the property determined by the property valuer, i w -theoretical, unit value of the property determined from the model.
The empirical study was carried out according to the following scheme. The collection of 318 properties was divided 500 times into a test set of 68 properties and a training set of 250 properties. For each of the 500 training sets, the SAREMA model and its variants with ridge and LASSO regularization were estimated. In order to select the optimal strength of regularization, the procedure of a 3-fold cross-validation was carried out. 70 different values of the β coefficient from 0.0001 to 1,000 were tested. The models with best β were used to estimate the value of properties in the test sets.
The same procedure for estimating and testing the models was repeated with a reduced number of properties in the training sets. This time they consisted of 50 properties, with an unchanged number of test sets, i.e. 68 properties. This scheme has resulted in the estimation of more than 400,000 models (including those estimated for cross validation). The final result of the estimation was 3,000 models.
The target variable in the models was the value of 1 square meter of properties. The models reflect valuers estimates rather than the market as such. The results obtained allow us to determine how well the econometric models imitate the results of real estate appraisal conducted by valuers. Figures 3-5 show the collective results of the comparison of non-regularized models with models for which regularization has been applied. Figure 3 shows the values of the coefficients of the determination of the estimated models. There are two main elements here. Firstly, because of the regularization, the SAREMA model's plain form R 2 is on average higher than that of the regularized models. This is expected since regularization tends to flatten the results. Secondly, determination coefficients for smaller training sets indicate another feature associated with multiple regression models (MLR), namely their tendency to overfit. With a smaller training set, the difference between R 2 coefficients for classic and regularized models is greater. In general, the values of these coefficients for models based on 50 properties are on average higher than for models based on 250 properties. Whether or not this is actually proof for overfitting will be revealed after the analysis of valuation errors in the test sets. To compare MLR, ridge and lasso models, the relative differences between MAPE for MLR and ridge models and MLR and lasso for both sizes of training sets were determined.
Negative differences indicate that a lower MAPE occurred in a given test set for the classic SAREMA form; positive differences meant lower errors of the regularized models. Figures 4 and 5 show the distribution of these differences. In each comparison the number of positive differences was higher. This means that in most cases, the models with regularization resulted on average in lower valuation errors than the models without regularization. For models estimated on the basis of larger training sets, ridge regression gave lower valuation errors in 299 cases out of 500. For lasso regression it was 269 cases. With small training sets, the advantage of regularized models was more frequent: 340 and 311 times, respectively. This confirms the hypothesis that the MLR models tend to overfit more often than regularized ones, especially in the case of a smaller number of properties in the training set. Interestingly, although lower errors were more often obtained for regularized models, in extreme cases that non-regularized models had a greater advantage over regularized ones. This was visible in the longer left tails of the distributions (especially for smaller training sets).  The shares of regularized models with lower and higher valuation errors than plain multiple regression are presented in (Table 2). The use of regularization was particularly beneficial when the training sets were smaller. It allows avoiding overfitting and eliminates problems resulting from the multicollinearity of variables.

Conclusions
As indicated in the literature, the linear model can be improved, by replacing plain least squares fitting with some alternative fitting procedures (James, Witten, Hastie, Tibshirani, 2017, p. 203). Such techniques allow to maintain the advantages of linear regression models, while improving the accuracy of prediction, eliminating the problem of the collinearity of variables or their low volatility (which may occur in the real estate market in particular) and increasing the interpretability of models. The article presents an example of using the regularization of multiple regression models in order to improve the results of the mass valuation of real estate. The results of the valuation of 68 properties estimated using the econometric form of the Szczecin Algorithm of Real Estate Mass Appraisal (SAREMA) with models supplemented with a component responsible for regularization in 500 repetitions were compared. Results obtained from 3,000 models show that regularization has in most cases reduced valuation errors in the test sets. The improvement in performance was more pronounced in the case of small training sets. Better results for both 250-and 50-element sets were obtained using ridge regression than lasso regression, so in this particular set of properties this first type of regularization proved to be a more effective way to minimize valuation errors. For small training sets the differences in MAPE were much higher than in larger sets (both in cases indicating an advantage of regularization and indicating an advantage of plain MLR). This means that with a larger training set, the impact of regularization improves (or worsens) the results of valuations to a lesser extent. It can be concluded that the less data you have on the real estate market, the more worthwhile it is to apply regularization.
Further planned research will be aimed at verifying the results obtained in other markets as well as for other types of real estate.