Model Based Clustering for Regency/City Grouping Based on Community Welfare Indicators in North Sumatra

.


Introduction
The desire of every individual in Indonesia or the world is to achieve a prosperous life, both in urban and rural areas, as well as in physical and spiritual aspects.Prosperity includes social, material, and spiritual elements accompanied by security, moral values, and inner peace [1].
Essentially, achieving community welfare is the main goal of every economic development effort.In Indonesia, welfare is also one of the state's goals, as expressed in the preamble of the 1945 Constitution, which aims to "protect all Indonesian people and all descendants of Indonesia and to improve the general welfare, improve the nation's knowledge of life."Social change in society will occur due to development projects in a region.New potential to improve the economic welfare of residents will develop along with the positive and negative impacts of these initiatives.This indicates that development not only affects significant economic growth but also results in changes in the social and cultural lives of the community.These changes involve aspects of lifestyle and the emergence of various social issues.As a step towards improving community welfare, the development process must proceed consistently, involving the community as actively involved stakeholders.In line with the spirit of regional autonomy that emphasizes community participation, this demands seriousness from the government through thorough planning involving all interested parties [2].
Model-based clustering is a clustering method that utilizes probability models to group data.One common distribution used in MBC is the normal distribution.However, it is known that not all data may fit the normal distribution, especially when there are outliers.Therefore, in 2012, Andrews and McNicholas developed a more robust model to handle data containing outliers by adopting the tdistribution.
This model was initially used to cluster objects in a population.The basic assumption in Model-Based Clustering (MBC) is that in a population, subpopulations with a certain probability distribution can be identified, and each subpopulation has unique parameters.All subpopulations have a Mixture Distribution with different proportions for each subpopulation.This assumption leads us to the mathematical probability model of Model-Based Clustering.Finite Mixture models in clustering have rapidly developed and become one of the popular clustering methods [3].

Descriptive Statistics
Descriptive statistics is a statistical analysis commonly used to organize and present data.Typically, descriptive statistics are used as a preliminary step to organize data before conducting further analysis.Research results can be extrapolated if the null hypothesis ( 0 ) is accepted.One or more variables are used in descriptive analysis, which is done separately.Therefore, there is no comparison or correlation between variables in this analysis.[4][5].

Cluster Analysis
Cluster analysis is a statistical analysis technique used to group objects into two or more clusters based on the similarity characteristics among the objects.Additionally, cluster analysis aims to maximize the similarity of objects within clusters while maximizing the differences between clusters [6][7].
There are two categories for cluster formation processes: hierarchical and non-hierarchical.Hierarchical methods are step-by-step methods.In this method, certain stages will be formed, such as in a tree structure, and can be generated as a dendrogram.Non-hierarchical methods are also called k-means methods.This method differs from hierarchical methods because non-hierarchical methods start by determining the desired number of clusters, and then the results of these observation objects are combined and form clusters [8][9][10].

Deteksi Outlier Multivariate
Multivariate outlier detection in research data is carried out to test the initial hypothesis that the data tends to contain significantly different values; thus, applying finite mixture model-based clustering with multivariate t distribution becomes more appropriate to obtain robust clustering results and can handle the presence of extreme values.[11].
One method used to assess the presence of multivariate outliers is by calculating the Mahalanobis Distance, which is defined as follows: ) , 1, 2,..., Where x is the sample mean vector, and S is the sample covariance matrix.An observation is considered an outlier if its Mahalanobis Distance value is greater than 2 ;1 /2 pa  − where p is the degrees of freedom and  is the predetermined significance level [12].

Multivariate t Distribution
Multivariate t-distribution is an alternative distribution used when there are many outliers in the data, causing the data distribution to become flatter and not follow a normal multivariate distribution.This multivariate tdistribution is an extension of the univariate t-distribution.
If a random vector variable 12 ...
 has a multivariate t-distribution with v degrees of freedom, then the mean vector 12 ...
 and the covariance matrix  have probability density functions as follows: This v is also known as the shape parameter because the variation in its values affects the shape of the distribution.The multivariate t-distribution is recognized for handling outliers better than the multivariate normal distribution.Therefore, the multivariate t-distribution is often used in model-based clustering.[13].

Model Based-Clustering
Although model-based clustering has advantages in characterizing groups with few parameters and meeting statistical assumptions, it also has drawbacks.One is long computation time, especially with many groups or datasets.Model-based clustering also faces challenges in estimating the correct number of groups [14].
Banfield and Raftery developed a framework for model-based clustering using the eigenvalue decomposition of the covariance matrix Σ.A is a diagonal matrix with elements proportional to the eigenvalues and indicates the contours of the Density function.[12].

Model Finite Mixture
Finite mixture models provide significant flexibility in modeling data with multiple modes, skewness, and nonstandard distribution characteristics.However, this flexibility is balanced by an increase in the number of parameters as the number of components increases [15].
Assume a random vector variable x with dimension p comes from a finite mixture distribution with probability density function: The ICL (Integrated Completed Likelihood) criterion has proven to be a popular approach in model-based clustering, as it automatically selects the number of clusters in a mixture model.This approach effectively maximizes the likelihood of the complete data, including allocating observations to clusters in the model selection criteria [16].
The selection of the best model-based clustering can use the ICL criterion.The principle is to maximize the likelihood function of the complete data.Therefore, the formula for ICL can be expressed as follows: ln ( ) ln( ) Where ( ) ( , ) is the likelihood function of the complete data.p is the total number of parameters, and n is the number of observations [11].

Community Walfare
Social welfare is the condition where all basic needs are met, especially fundamental ones such as food, clothing, housing, education, and health care.Here are some indicators of welfare [17].
HDI -The United Nations Development Programme (UNDP) has been using the Human Development Index (HDI) since 1990 to assess a country's human development achievements and releases it in an annual report known as the Human Development Report (HDR) [18].
Poor Population -The Copenhagen Social Development Action Programme in 1995, a high-level meeting worldwide, is evidence of this.Poverty, unemployment, and social exclusion are some social issues that require immediate attention and are important to be the main agenda in every country [19].
Open Unemployment Rate -Dealing with unemployment is one of the most difficult problems.Despite experiencing a decrease, there are still many unemployed people in Indonesia.Human development is the key to shaping a country's ability to develop its capabilities to create job opportunities to reduce the unemployment rate [20].
Gross Regional Domestic Product (GRDP) -One important metric in this assessment is Gross Regional Domestic Product (GRDP), which illustrates the importance of understanding the economic conditions of a region in a specific period.GRDP provides a comprehensive overview of the economic contribution of an area during a certain period [21].
Health -The reasons behind the decline in human quality of life, individually and collectively, remain a matter of debate.Part of this problem is the difficulty of conducting research on humans that can identify causal relationships.It is important to acknowledge that this issue is highly complex and influenced by many factors [22].

Data Description
Data description is an effort to present data in a way that is easy to understand and can be interpreted well.In this study, the data used consists of five independent variables, namely: Human Development Index (HDI) ( 1 Regional Domestic Product ( 4X ), and Health ( 5 X ).
The data used in this study is secondary and obtained from the Central Bureau of Statistics of North Sumatra from 2018 to 2022.The research area is as follows:

Multivariate Outlier Detection
The Mahalanobis Distance for each observation can be calculated and will indicate the distance of an observation from the mean of all variables in a multidimensional space [23].
The outlier detection method is carried out by calculating Mahalanobis and robust distances.These results are then compared with the cut-off value from the distribution 2 ;0,05 p  , Because this study uses 5 variables, the degrees of freedom are k=5, and the significance level is 0.05.The cut-off values generated each year will vary.Points outside this boundary are considered outliers and marked with a special symbol.
To determine whether regencies/cities are detected as outliers or not, outlier tests will be conducted on the five variables: HDI, poor population, LFPR, GRDP, and health.The mean value () is obtained as follows :  ( ) -0.400 5.044 -2.973 -2.560 7.885 The covariance matrix values for the year 2018 were calculated using Excel, and the covariance value for Nias Regency is as follows: Then, from these matrices, the Mahalanobis Distance can be calculated using the formula:  The same method calculates the Mahalanobis Distance for the next regency up to North Padang Lawas Regency.Similarly, the same process is carried out for the other 33 regencies, because this study uses 5 variables, the degrees of freedom k=5 and the significance level 0.05 are:  The image shows a plot of Mahalanobis Distance and robust distance for Regencies/Cities in North Sumatra for 2018.From the visualization, four points are significantly far from the cut-off intersection line, indicating that four regencies/cities are outliers in the multivariate data.The regencies/cities detected as outliers through this analysis are Sibolga, Pematang Siantar, and Padangsidempuan.
Then, the same procedure was carried out for the next 5 years, and the results are as follows.

Model-Based Clustering with Integrated Completed Likelihood Criterion
Model-based clustering (MBC) is capable of identifying at least 28 models with a maximum number of groups of 9 using the teigen package in R software.The selection of the optimal group is based on the largest value of ICL [24].

Clustering of Regencies/Cities in 2018
The Teigen package in the R programming language can identify 28 possible models with a maximum number of groups of up to 9 groups for MBC mixture t multivariate with ICL criterion.
The 2018 analysis revealed the highest ICL value of -186.1064.This figure was achieved using two cluster groups in the CICC model.Gunungsitoli

Clustering of Regencies/Cities in 2019
The data analysis results for 2019 show that the highest ICL value is -172.8015.This value was achieved when the data was clustered into 3 groups using the CICU model.
The marginal contour plot of the Community Welfare Indicator data in 2019 yields the following results: Here is a detailed table of the clustering results of regencies/cities 2019.

Clustering of Regencies/Cities in 2020
The data analysis results in 2020 show that the highest ICL value is -175.2595.This value was achieved when the data was clustered into 2 groups using the CIUC model.From this model, it can be seen that a clustering pattern with the following characteristics emerged:  The clustering process of regencies/cities in North Sumatra in 2021 can be visualized through a marginal contour plot, as shown in the following image.

Clustering of Regencies/Cities in 2022
The data analysis results for 2022 show that the highest ICL value is -191.6845.This value was achieved when the data was clustered into 2 groups using the CICC model.From this model, it can be seen that a clustering pattern with the following characteristics emerged.Gunungsitoli

Cluster Similarity Test
Cluster similarity testing aims to identify significant differences between the formed groups.This process involves testing for mean differences using the Manova method, a multivariate analysis to assess whether the population mean vectors are similar.

H
. The results of the cluster similarity test with Pillai's Trace statistic show that the p-value generated is smaller than the significance level  (0.05) for each year [25].This indicates significant differences between the mean vectors of the groups in each year.Therefore, it can be concluded that Group 1 significantly differs from the other groups each year.With the rejection 0 H of the null hypothesis, cluster analysis for each city in North Sumatra can be performed.

Conclusion
Based on the analysis and discussion, it can be concluded that model-based clustering can help with grouping.Based on the distance-distance plot for outlier detection, it was found that at the 90th percentile of the data used in this study, there were 3 outliers each year.The formed clusters show that in 2018, 2020, 2021, and 2022, 2 clusters were formed each year, which is the ideal number to use.However 2019, there were only 3 clusters, with the regencies/cities of Nias, South Nias, North Nias, and West Nias consistently in Cluster I.The clustering results from 2018 to 2022 show that Cluster I represents regencies/cities with low HDI and GDP compared to those in Cluster II and III.


is a scalar value that indicates the volume of the Ellipses.g D is an orthogonal matrix of eigenvectors that Represent the orientation of the principal.Components. g called the probability density function of x with group parameter G  , G is the number of groups, and g  is the weight or mixing proportion of group g subject to the following constraints [13 .04+14.3+8.46+…+14.49)/33 = 12.154After obtaining the mean (), the next step is determining the mean difference value ( − ) for Nias City.

  2 .
63-3.03 16.37-11.3251.62-4.5930.47-3.0320.04-12.154 Meaning based on the Chi-Square table with degrees of freedom and significance of 0.95 is: results to indicate whether each regency is an outlier or not are presented in the following table.
The marginal contour plot results for the Community Welfare Indicators in 2018 are as follows:

Figure 2 .
Figure 2. Marginal Contour Plot Data for the Year 2018 By observing Figure 1, which displays the visualization of the formed group members, further information regarding the clustering results of regencies/cities in North Sumatra in 2018 can be found in the following table:

Figure 3 .
Figure 3. Marginal Contour Plot Data for the Year 2019

Figure 4 .
Figure 4. Marginal Contour Plot Data for the Year 2020 Considering Figure 4, which shows the visual representation of the formed group members, differences between group members are indicated through color variations.Further details regarding the clustering results of regencies/cities in North Sumatra in 2020 can be found in the following table.

Figure 5 .
Figure 5. Marginal Contour Plot Data for the Year 2021 The visualization of the formed group members in 2021 is shown in Figure 5.Each group is represented in the contour plot.Further information regarding the clustering results of regencies/cities in North Sumatra in 2021 can be found in the following table.

Figure 6 .Figure 6
Figure 6.Marginal Contour Plot Data for the Year 2022 Figure 6 shows the visualization of the formed group members in 2022.Theoretically, each group's contour plots should have similar shapes and volumes.However, differences in the results can be caused by the perspective of contour extraction and different variable combinations.More detailed information regarding the clustering results of regencies/cities in North Sumatra in 2022 can be found in the following table.

Table 1 .
List of Regencies/Cities in North Sumatra

Table 2 .
Mahalanobis Distance and Robust Distance of Regencies/Cities in 2018.
Figure 1.Plot of Mahalanobis Distance against Robust Distance for 2019.

Table 3 .
List of Regencies/Cities Detected as Outliers.

Table 4 .
Clustering Results of Regencies/Cities in North Sumatra in 2018.

Table 5 .
Clustering Results of Regencies/Cities in North Sumatra in 2019.

Table 6 .
Clustering Results of Regencies/Cities in North Sumatra in 2020.Clustering of regencies/cities in North Sumatra in 2021 was carried out until reaching 2 groups, resulting in 2 groups.This is because the largest ICL value was obtained at G=2, with the largest value being -174.0472.The appropriate framework model for the matrix in this case is the UIUC model.

Table 7 .
Clustering Results of Regencies/Cities in North Sumatra in 2021.

Table 8 .
Clustering Results of Regencies/Cities in North Sumatra in 2022.

Table 9 .
Results of Group Equality Test with Manova Test.If the p-value <  at a significance level of 0.05, the decision from the test is to reject the null hypothesis 0