APPLICATION OF MULTIVARIATE TIME SERIES CLUSTER ANALYSIS TO REGIONAL SOCIO-ECONOMIC INDICATORS OF MUNICIPALITIES

The socio-economic development of municipalities is defined by a set of indicators in a period of interest and can be analyzed as a multivariate time series. It is important to know which municipalities have similar socio-economic development trends when recommendations for policy makers are provided or datasets for real estate and insurance price evaluations are expanded. Usually, key indicators are derived from expert experience, however this publication implements a statistical approach to identify key trends. Unsupervised machine learning was performed by employing K-means clusterization and principal component analysis for a dataset of multivariate time series. After 100 runs, the result with minimal summing error was analyzed as the final clusterization. The dataset represented various socio-economic indicators in municipalities of Lithuania in the period from 2006 to 2018. The significant differences were noticed for the indicators of municipalities in the cluster which contained the 4 largest cities of Lithuania, and another one containing 3 districts of the 3 largest cities. A robust approach is proposed in this article, when identifying socio-economic differences between regions where real estate is allocated. For example, the evaluated distance matrix can be used for adjustment coefficients when applying the comparative method for real estate valuation.


Introduction
The socio-economic development of a country, municipality or region can be defined by a set of indicators. Although the socio-economic situation is usually evaluated annually according to the indicator values in the respective year, the variation of indicators can also be presented as multivariate time series. In the current research, the amount of analyzed data increased and consisted of indicator values, territorial units and periods. That is why machine learning methods are employed in the latest research. The multivariate time series clustering for socio-economic indicators of municipalities enables municipalities with similar development to be identified based on the relationships between the indicators and their trends.
Market trend monitoring and the quantification of a region's performance is essential in order to provide evidence-based policy recommendations or to apply adjustment coefficients when valuating real estate. The authors of this publication previously published a methodology for the comparative method for real estate valuation (Gružauskas, et al. 2020). The methodology provided in this publication provides more detailed information for the calculation of the location adjustment coefficient by using the distance matrix obtained from the clustering results. However, the proposed methodology can be applied for other reasons to analyze the real estate market. Belej and Kulesza (2014) analyzed real estate prices in Poland and quantified the high inertia of real estate markets, which manifested during rapid structural changes. Kokot (2020) analyzed the influence of socioeconomic factors on housing prices in Poland and proposed a City Wealth Synthetic Measure. Usman et al. (2020) conducted an extensive literature analysis of property market segmentation into submarkets to improve the analysis of real estate prices. The publication mainly describes 3 types of methods to segment the market, i.e. predefined regions, statistically obtained regions and their combination. The reviewed methods focus mainly on spatial clustering approaches, and only briefly mention the importance of temporal analysis, thus the proposed methodological approach of our publication could provide a tool for market segmentation. Nugroho et al. (2020) analyzed regional economic growth to form regional clustering based on the speed of house price growth, so that the monetary policy in the housing sector achieves the target. However, real estate analysis is not only limited to price analysis. Asamoah et al. (2019) conducted an extensive literature analysis to determine the influence of economic indicators which are important for players in the construction industry including policy makers. The determination of economic factors f the construction industry could reduce the incidence of the high failure rate of construction firms. Kazak et al. (2017) analyzed the ageing society to determine which segments of real estate should focus on older people. Thus, the proposed methodological approach in the publication could be applied to determine patterns in time series data with robust insights. The proposed methodological approach in our publication could also help to identify other trends and form indicators of the regions. One of the major concerns in today's world is sustainability, and thus various indexes have been developed to analyze this phenomenon in the regions. For example, Manzhynski et al. (2016) measured the sustainability performance level of the Baltic region by 4 types of indicators, such as Adjusted Net National Income, Adjusted Net Savings, Environmental Performance Index, Human Development Index and Sustainable Value. The Vilnius Institute of Policy Analysis developed a municipality welfare index, which combines 5 categories, i.e.: evaluations of social security, physical safety, economic level, education level and demography (Vilnius Institute of Policy Analysis, 2019). Salvati and Carlucci (2014) proposed a sustainable development indicator for Italy; the research paper combined a wide range of variables and used a statistical tool to determine the composite indicator weights. A similar technique was used by Seidel et al. (2019), who developed an indicator to measure organic agriculture. Also, Senna et al.
(2019) used a statistical tool to develop a water poverty index. The main difference between the proposed indicators is the approach which was used to determine the weights of the composite indicators. One category of indicators is based on the experts' decision as to what weights and indicators should be used in the composite indicator. Another category employs statistical tools, such as principal component analysis, factor analysis, clustering analysis and others. A detailed description of the available tools has been provided by the Organisation for Economic Co-operation and Development (Mattes & Sloane, 2015). A multivariate statistical analysis was used to analyze the spatial and socio-environmental consequences of applying general spatial planning in the municipalities of Catalonia (Serra, et al. 2014). Greco et al. (2019) provided a comprehensive comparison of the main existing approaches. The goal of these indicators is to quantify the www.degruyter.com/view/j/remav vol. 29, no. 3, 2021 performance of regions and to provide recommendations for policy makers. The described composite indicator approach uses statistical tools to derive the indicators for measuring the region's performance. Usually, it is experts who propose approaches to measure performance. Nowadays a large amount of data is used to describe the performance of the region and machine learning can be applied to obtain insights regarding the economic situation in the region. Einav and Levin (2013) analyze the value of big data and predictive modelling tools. They stated that large-scale administrative datasets and proprietary private sector data can greatly improve the measurement, tracking and description of economic activity. Athey and Luca (2019) indicated that technology companies employ economists more often than ever. This also results in developing machine learning approaches for public policy. In this publication, an example was provided to demonstrate how economists used statistical data to help explain gentrification in neighborhoods. Athey (2019) stated that unsupervised learning can be used to create new variables without human judgement in economic analyses.
When analyzing data on a regional level, the unsupervised learning approach could be used to identify similar regions and obtain insights which explain their performance levels. Municipality clusterization can be applied to predict the demand of economic stimulation, to identify similar social problems, to monitor social and economic development, to plan logistics, to extend the datasets of similar objects in realty price evaluation, insurance costs, etc. Cluster analysis of municipalities based on social and economic development indicators can be applied to identify the regions which are in the highest need of economic stimulation (Brauksa, 2013). Brauksa (2013) stated that one of the principal factors of the same group is the region the municipality is located in. A factorial and component analysis with hierarchical clustering was used to group the municipalities of Lithuania based on their socio-economic situation, in order to identify municipalities which are the most attractive for the foreign investment (Burinskienė & Rudzkiene, 2004). The K-means algorithm was applied to group municipalities of Slovenia in order to examine social-economic differences among municipalities (Rovan & Sambt, 2003). The clusterization indicated significant differences in socio-economic development and its result was one of the criteria for the approval of project funds. The comparison of regions in Visegrad Group Plus countries was performed according to the Human Development Index in (Majerova & Nevima, 2017). A combination of regression analysis, Ward and K-means methods was performed to cluster the regional labor and vocational training market in Germany (Kleinert, et al., 2018;Blien, et al. 2010). Rezankova (2014) applied a clustering algorithm on data of enterprise, macroeconomic and economic activity by age groups to European countries, including Lithuania. Majerova and Nevima (2017) applied a hierarchical clustering method by measuring the distance between the clusters as the squared Euclidian distance and Ward method on the human development index with a z-score normalization technique on the data. Augustysnski and Laskos-Grabowski (2018) compared various clustering algorithms and determined that the best results were achieved on their dataset by applying a compression-based dissimilarity measure. In most research, clusterization was performed for the indicators of a specific year. In the second step, the change between the clusters is usually analyzed (Burinskienė & Rudzkiene, 2004;Majerova & Nevima, 2017). The proposed approach enables clusters of similar municipalities to be determined with respect to the time series of socioeconomic indicators and the relationships between the indicators.
In cluster analysis, the result depends on the applied method and the parameters that are considered in the model. Although different methods often give different results, even the same method can give different results in different runs. For example, the results of clusterization using Kmeans depend on the random initial distribution. However, in the clustering of municipalities, the aim is to find similar development trends between municipalities and a slightly different clusterization result is not essential. Although some research analyzes the development of the region, clusterization is usually performed for the annual data, and then the clustering results of different years are compared. In this paper, we consider the development of different municipalities based on the change of demographic and economic indicators provided as time series.
The aim of this paper is to demonstrate an application of the multi-variate time series clustering algorithm for real-life data. The data consisted of time series of social and economic indicators of the municipalities of Lithuania in the period between 2006 and 2018. The algorithm for multi-variate time series clusterization was presented in (Li, 2019) and includes common principal component analysis, multivariate time series analysis and k-means clusterization. www.degruyter.com/view/j/remav vol. 29, no. 3, 2021

Data
The multivariate time series are analyzed in various fields of the economy, including forecasting stock prices and interest rates, the development of regions, etc. Therefore, the general concept of multivariate time series is discussed in this section. The dataset consists of multivariate time series: here the element represents the k-th object and defines the change of its parameters in time. is a matrix of size x , here is the length of and is the number of variables: here is the i-th element of the j-th variable of the k-th object. As the objects of the same type are analyzed, they usually have the same variables which describe the development of the object. The length of the time series can be different for the object as objects can be observed in different intervals.
When the variables in time series represent values of various origins, they can obtain significantly different values. In order to use different variables with the same weight in the calculations, values of variables are standardized according to the following formula: ( 3) here is the i-th standardized item of the j-th variable of the k-th object, is the mean value of the jth variable for the set of all objects, is the standard deviation of the j-th variable for the set of all objects. The dataset of standardized multivariate time series is considered in the time series analysis: This enables to analyze data of different variables with the same weight, as all values of one variable are standardized to the distribution with the mean value equal to 0 and standard deviation equal to 1.

Common Principal Component Analysis
The algorithm for multivariate time-series clusterization was originally designed for the field of engineering (Li, 2019). In this publication, the algorithm was adapted to clustering time series of economic data. The algorithm consists of a combination of principal component analyses for common space and k-means clusterization. The common principal component analysis is broadly explained in (Li, 2019). Firstly, the multivariate time series is transformed to covariance matrix: here is the covariance matrix of the k-th object. Although the normalized values of the time series are used in the calculation of the covariance matrix by itself, it should be noted that, in this case, normalization is performed only with respect to the analyzed time series. Moreover, the deviation is not changed in this normalization. In general, variables can obtain values of significantly different magnitudes. That is why the standardization of time series should be performed before common principal analysis.
Secondly, the average covariance matrix is calculated according to the following formula: This matrix generalizes the covariance of the variables for all objects of the dataset. The averaged covariance matrix is decomposed using singular value decomposition (SVD). The result of the decomposition is a sorted vector , , … , of eigenvalues and matrix , , … , of the respective eigenvectors, where ⋯ . The importance of information stored in the eigenvector is defined by its respective eigenvalues. The most valuable information is provided in the first components and, therefore, they are used in the analysis. This enables to the dimensionality of the data to be reduced and the most significant information to be retained. Although there are no general criteria to define the number of principal components, it is usually chosen based on the magnitude of eigenvalues. The common space is constructed of the first eigenvectors as , , … , . Each multivariate time series can be transformed to the common space by the following transformation: (7)

K-means Clusterization
The clusterization of multivariate time series is based on the error, which is obtained after the projection of the multivariate time series to the common space , and its reconstruction to the initial space. The resulting time series is obtained by the formula (Li, 2019): here is the time series of the k-th object after standardization, is the common space defined by the projection axes. As only the first components are used in the analysis, the initial and resulting time series are not equal and the reconstruction error E of the k-th object is calculated as the Frobenius norm of the difference between the time series and : Here, is the length of and is the number of variables in the time series. Obviously, the magnitude of the error depends on the number of principal components. If all the components are used in the analysis ( ), no information about the object is lost during transformation and the error is generated only because of numerical calculations; it is therefore close to 0. Similarly, if some variables are collinear in the multivariate time series, the respective eigenvalues can be close to 0. The elimination of such eigenvalues results in the insignificant loss of information and a small error after transformation and reconstruction. The error increases as the number of principal components decreases because more information is eliminated from the analysis.
The number of clusters is chosen empirically based on the total error generated by the clusterization using different numbers of clusters. In the initial step, the set , … , of clusters is formed by assigning the objects to cluster randomly. The common space of the i-th cluster is constructed with respect to all objects assigned to the i-th cluster and defines the centroid of the cluster. The construction of this common space is described in the previous chapter. For all objects, the reconstruction error is calculated for all centroids of clusters. The object is assigned to the cluster for which the minimal reconstruction error is determined. Calculations are performed until clusters do not change, or iteration count is less than the maximum number of iterations . In order to reduce the effect of random initialization, clusterization is performed for the determined number of runs and the result with the smallest summing error for all clusters is taken as the final result. It should be noted that a small number p of the principal components used in the analysis results in a relatively large error after projection to the common space and reconstruction. This fact may lead to situations where the only object of the cluster has a smaller reconstruction error calculated with respect to the centroid of the other cluster, and should be assigned to this cluster in the following iteration.

Data
Lithuania is divided into 60 municipalities. Some of them cover cities (Vilnius, Kaunas, Klaipeda, etc.), with their main characteristic being a high population density. These municipalities are surrounded by regional municipalities including the suburbs of the cities. Obviously, the development of the cities has an impact on the development of the regional municipalities around them. In other cases, municipalities cover average-sized towns, with respect to their population and surroundings. www.degruyter.com/view/j/remav vol. 29, no. 3, 2021 In total, annual values of 27 indicators covering the period from 2006 to 2018 were used in this research. The indicators can be classified into 3 categories, such as demographic, housing and economic indicators. The main demographic indicators consist of population size, population density, ageing index, births and deaths. Additionally, indicators which describe the movement of residents are added. The additional indicators consist of newcomers, emigrants, immigrants, departures, net migration, net inside migration, and their combination. These indicators are important to describe the population tendency and needs. For example, people of a specific age group have a tendency to go to university, to rent flats, to purchase housing, to travel and so on. Thus, housing indicators are important when analyzing the change of demographic indicators in the region. The housing indicators consist of number of dwellings, number of houses, number of none-resident estate, residential fund, useful area per person and average dwelling size. The last category is economic indicators, which consist of the number of unemployed people, number of employees, employment rate, ratio of unemployed to those of an active working age, monthly wage, direct foreign investments, municipality income and municipality expenditures. The economic indicators are important to describe the potential of income and employment in the region. The integration of population, housing and economic indicators provides a better understatement of the region's social-economic tendencies.
In order to use different indicators with the same weight in the calculations, values of indicators are standardized according to Eq. 3. To maintain the tendencies of indicators for the analyzed period, the mean value and standard deviation are calculated for all municipalities and years of the analyzed period. This standardization enables the change of indicators which have values of significantly different magnitudes or are given in percentages to be compared, as the values are standardized to a distribution with the mean value of 0 and standard deviation of 1.

Determining Number of Common Principal Components for the Analysis
Clusterization results depend on the number of the common principal components used in the analysis. To extract the most important features of the time series, only several first principal components should be used. The eigenvalue analysis has been used to determine the number of the components which should be used in the analysis. The eigenvalues of the covariance matrix for all municipalities in one cluster is provided in Fig. 1. The eigenvector of the largest eigenvalue accounts for the axis with widest distribution among data. Small values determine that the data is concentrated in this axis. In the numerical example, 3 first principal components are employed. Obviously, these components can represent different features in different clusters. However, this is one of the advantages of the common principal analysis, as it combines objects with features which are typical for the same cluster. www.degruyter.com/view/j/remav vol. 29, no. 3, 2021

Clusterization Results
To determine the number of clusters which should be applied in the determination of distinct clusters, clusterization was performed for various numbers of clusters. The total error was calculated as the sum distance of the objects to their centroid. It should be noted that, contrary to the standard K-means method, if a cluster consists of only one municipality, the distance to the cluster center (error) can be greater than zero. This is due to the fact that only the specific number of principal components is used in the transformation and reconstruction of the multivariate time series and some information can be lost resulting in the error. The total error was calculated after 100 iterations or if the cluster assignments do not change. Obviously, the dependency of the total error on the number of clusters can change if another initial distribution is used in the initialization step of K-means. That is why 10 cases were performed to determine the trend of dependency of total error on the number of clusters. The errors of individual runs are provided in blue (Fig. 2); the black curve represents the mean value of the error which was calculated for the respective number of clusters. The 5 clusters were used in successive calculations as the value of the averaged total error demonstrates a steep fall up to this number (Fig. 2).  50)) are assigned to one cluster. If municipalities are grouped to three clusters, the fifth largest city (Panevezys c. (57)) is also assigned to the same cluster. Similarly, the districts of the three largest cities are assigned to another cluster as they demonstrate similar development of macro-economic indicators. The lists of municipalities assigned to clusters are presented in Table 1 if the number of clusters is equal to 5. The numbers in the brackets next to the title refer to the municipality identity number in Fig. 3. Table 1 Municipality groups after clusterization to 5 clusters The distances between the centroids of 5 clusters are given as the distance matrix in Fig. 4.The centroids of the first two clusters (cluster of largest cities and cluster of their districts) significantly differ from the remaining three clusters which represent the remaining municipalities. It is also worth noticing that the in-between distances between clusters 3, 4, 5 are smaller than the in-between distances between clusters 1 and 2. In order to analyze influence of the different groups of macro-indicators, clusterization to 5 clusters using separate groups of economic, housing and demographic indicators was performed. As in the previous clusterization, 3 principal components were used in the analysis. The results of the clusterization have been presented in Fig. 5. All results show that the two largest cities (Vilnius c. (37) www.degruyter.com/view/j/remav vol. 29, no. 3, 2021 and Kaunas c. (50)) are grouped to the same cluster. For clusterization based on economic indicators, the third largest city (Klaipėda c. (13)) is also included in this cluster. Clustering based on the demographic results groups the four largest cities into the same cluster. It is worth noticing that clusterization based on the demographic results groups the municipalities into 4 clusters, although 5 clusters were used as the initial number of clusters. Besides the cluster of the four largest cities, there is also a cluster of two smaller cities (Panevezys c. (57) and Alytus c. (29)). The remaining two clusters represent rural areas and smaller cities. Similar tendencies of differentiating between cities and rural areas were observed for all groups of indicators. As an example to represent the dynamics of indicators, mean, minimum and maximum values of monthly wages in clusters obtained for economic indicators have been presented in Table 2. The colors correspond to the cluster colors in Fig. 5 b. In order to represent the dynamics of monthly wages in different clusters, the dynamics of mean values are provided in Fig. 6. Although all curves show similar tendencies (growth until 2008, plateau or small decrement in the period of 2008-2011 and growth again in the later years), this also demonstrates the significant gap between the first cluster of large cities and other municipalities. Similar research based on the capability to attract foreign investment in municipalities of Lithuania for the period from 1996 to 2001 was provided in (Burinskienė & Rudzkiene, 2004). The results showed that the development of municipalities depends on the location, thus the distance to the largest cities. Moreover, the socio-economic situation is similar in most municipalities of Lithuania. Although the periods and indicators differed from the ones used in this research, the results are found to be corresponding due to rather slow economic processes in the region and relationships between the indicators which define the economic situation. The results obtained in both studies show similar trends to those of the welfare index of municipalities for 2019 published in (Vilnius Institute of Policy Analysis, 2019). However, the approach proposed in this article enables to group municipalities based not only by the indicators of one year, but also by the change in indicators.

Discussion and conclusions
The time series clustering analysis requires specific trends to be analyzed in a more comprehensive manner. It can, however, be used to quickly identify similarities and differences between the regions without specific expertise. Afterwards, the insights can be used to deepen the analysis for policy makers or new investors to quickly receive basic knowledge of a new market. Additionally, the www.degruyter.com/view/j/remav vol. 29, no. 3, 2021 obtained distance matrix between the regions can be used as an adjustment coefficient when applying the comparative method for real estate valuation.
The proposed methodological approach could be applied in various fields to improve the decision making process. For instance, Asamoah et al. (2019) conducted an extensive literature analysis of the economic indicators influencing construction industry. The research identified 59 indicators based on literature analysis. The provided methodological approach in our publication could be applied to the dataset to help identify dependency between the time series data. Nugroho et al. (2020) analyzed housing prices of Indonesia and focused on multiple regions and the growth rate of housing prices. Their research focused mainly on the analysis of 4 growth types of housing price; however, based on the presented methodology, it seems that the types were determined based on the analytical and/or expert approach, and not statistical analysis. The presented approach of time-series clustering may be used to provide a more robust analysis approach, which would help to quantify the housing price growth more precisely. Kokot (2020) analyzed the influence of socio-economic factors on the housing market in Poland. Firstly, a correlation analysis was carried out in the publication; afterwards, the variables were normalized and the k-mean clustering method used. The applied clustering analysis focused on correlation indicators and not the actual indicators as in a time series. Thus, the application of the proposed methodological approach in our publication could provide a more in-depth analysis of socio-economic factors.
In the case of our publication, the obtained clusterization results showed that there is a distinct cluster of largest cities of Lithuania. This cluster is clearly defined in all results obtained for economic, housing and demographic indicators. Another cluster is extracted for regions which are close to the largest cities. It should be noted that differences between other clusters are not significant. However, some tendencies can also be identified; for example, whether the economy of municipalities in the same cluster is based on agriculture, resort activities, etc. The main limitation of the proposed method is sensitivity to the number of clusters and number of principal components used in the analysis. As these parameters must be selected in advance, careful analysis must be performed on what values are appropriate for the analysis. Several future research areas can be considered in order to improve the practical application of the proposed algorithm. The calculation of the common space for each cluster enables the most important information for the objects of the cluster to be obtained and the respective variables combined. For different clusters, the most significant variables can differ. For future research, the number of principal components can be chosen for each cluster individually as the number of eigenvalues which were calculated for the covariation matrix of the cluster objects and exceed a determined threshold value.
In the future, additional types of indicators such parameters for agriculture, sustainable energy consumption and other areas can be integrated in the time series clustering process. The sustainability aspects of the region could better explain the performance of social-economic change in the regions. The publication proposed a way to cluster municipalities and to derive initial information about them without the involvement of experts. The proposed approach can be used by different institutes, which can integrate other types of variables and derive a precise indicator to quantify the performance level of the regions. Afterwards, a classification algorithm may be used to estimate the tendency of the region based on the new time series data. The proposed methodology can be used by policy makers to improve the decision-making process and to improve the performance measurement process of the recommended policies. Alternately, the proposed approach can be used to quantify differences between regions when applying the comparative method for real estate valuation. In this case, the data set consisted of socio-economic indicators, though it can be supplemented with more precise risk indicators related to the business or real estate sector.
In conclusion, the approach proposed in this article may be an example of multivariate time series analysis in economics. This makes it possible to obtain clusters which are based not only on the current situation but on the changes of the socio-economic indicators in the analyzed period to be constructed. The approach of clustering municipalities with respect to the series of socio-economic indicators can be applied in various areas, including but not limited to developing pilot projects in order to stimulate social welfare and sharing good practices between the municipalities of the same cluster and expanding the dataset which is used to evaluate real estate or insurance prices.