MODELING THE PROPORTION OF MEASLES CASES USING SPARSE LEAST TRIMMED SQUARES

,


INTRODUCTION
Measles is a disease that can be prevented by immunization [1].Measles is also called Morbilli, a highly contagious disease of the genus Morbillivirus, and belongs to the RNA virus group (Ministry of Health RI, 2018) [2].Measles is spread worldwide and is included in the ten most infectious diseases in several developing countries, including Indonesia (RI Ministry of Health, 2022) [3].The Indonesian Child Protection Commission (KPAI, 2023) reported 55 extraordinary incidents in 34 districts/cities in 12 provinces.Based on the release of the Ministry of Health of the Republic of Indonesia (Kemenkes RI), the number of measles cases in 2022 was reported to have reached 3,341 cases spread across 223 districts/cities.This data has increased 32 times compared to 2021 [4].
In 2022, North Sumatra will become one of the provinces that has designated measles as an extraordinary event.According to data from the North Sumatra Provincial Health Service, 127 positive cases of measles were recorded in 2022.The city of Medan had the highest number of positive measles cases, namely 66.Then followed by Deliserdang with 14 cases and Batubara with 8 cases.Next, Serdangbedagai, Langkat and Sibolga each had 6 cases.Furthermore, Tebingtinggi 4 positive cases.Central Tapanuli and Binjai City each had 3 cases.North Labuhanbatu and Simalungun 2 Cases.Then, South Tapanuli, Labuhanbatu, Nias, Samosir, Padang Lawas, and Gunung Sitoli City each had 1 case.It is suspected because the data contains outliers [5].Therefore, more research is required to determine the variables that affect the percentage of measles cases in North Sumatra [6].
Finding the variables that affect the proportion of measles cases can be done using regression analysis.Ordinary least squares (OLS), a mathematical method, is one way to obtain the regression coefficient.High-dimensional data is not compatible with OLS because the estimations that result will be erroneous.Because LASSO regression may select variables by lowering the regression coefficient to zero, it could be used to solve the problem of high-dimensional data [7].However, the existence of outliers can have an impact on LASSO.A robust strategy is advised in cases where the data contains outliers.A technique that can address these two issues simultaneously is LASSO robust regression to obtain a simplified model.Therefore, a technique that combines strong regression with LASSO regression is required.According to some research, Sparse Least Trimmed Squares (Sparse LTS) is the suggested approach [8].
This research aims to solve the problem of high-dimensional data and outliers.Using the Sparse LTS approach, we succeeded in selecting independent variables from 34 to 14 independent variables included in the Sparse LTS modeling, and the results of calculating  2 with R-Studio show that  2 for Sparse LTS is 93.75% and for LASSO is -62.4%.It shows that Sparse LTS is better to use than the LASSO method.The findings of this study can be a useful resource for governments to focus on issues that significantly influence the proportion of measles cases.

Sparse Least Trimmed Squares (Sparse LTS)
Sparse least trimmed squares combine the strong method and the sparse estimation method.Sparse LTS can control data that is high in dimension and has outliers.High-dimensional data has greater explanatory variables than observations [8].Consider a regression of the response y on a matrix design assuming a linear relationship between the explanatory variable X ∈ ℝ × and the response variable y y ∈ ℝ × ,  =  + Where the regression coefficient is  = ( 1,,  2 , … ,  p )  and ε is an error that has a zero expectation value.With the penalty parameter α, the LASSO estimate of β is as follows: LASSO regression uses the  1 normalization technique to estimate regression coefficients, which can shrink the regression coefficients of variables that have a high correlation with error, with the aim of the regression coefficient being close to zero or equal to zero.So, the LASSO method can play a role in variable selection while overcoming multicollinearity [7].However, the LASSO regression is not resistant to outlier data.An outlier is an observation whose observation point deviates from the data pattern.The presence of outliers can cause large residuals.So, a robust regression method is needed to handle this case [10].The Least Trimmed Squares (LTS) method is a High Breakdown Value method, an alternative method to overcome the weaknesses of the Ordinary Least Squares (OLS) method.A robust regression indicator estimation technique is the LTS approach.The most often used robust regression estimator, this estimate has a straightforward specification and may be computed quickly [11].
Where  () 2 is the squared residual ordered from for   = (  −    ),  = 1, … , n. LTS has the same principle as the OLS method in estimating regression parameters, namely minimizing the number of residuals.However, the LTS method does not use all observations in its calculations but only minimizes the sum of residual squares from a subset of data of size h.Observations with the smallest residual squares only work ℎ <  or  > .Thus, if an observation contains  < , it is proposed to continue the fast-LTS algorithm for sparse data by adding a penalty  1 with parameter α to the LTS estimation coefficient, which leads to sparse LTS estimation [8].The form of the sparse least trimmed squares equation is as follows: ) For ℎ ≤  and tuning parameter  ≥ 0

Criteria for Types of Data Analysis Techniques
The coefficient of determination ( 2 ) and RMSE (Root Mean Square Error), used as comparison parameters to choose the best model, are the parameters for the data analysis technique used.The value of the  2 coefficient of determination and the RMSE determines the best model.The form of the RMSE and  2 equations is as follows: (5) Where n is the number of observations,  ̂ is the prediction of the ith response,  ̂ is the value of the ith response variable, and  ̂ is the average value of the response variable.The best model if it has a coefficient of determination ( 2 ) that is more significant and has a smaller value on RMSE.

RESEARCH METHODS
This research begins with an initial literature study, namely data collection on the proportion of measles cases in North Sumatra.It then begins with applying LASSO and compares it with LTS sparse.R and SPSS software were used to analyze data in this research.The next stage is dataset detection using boxplots.Then, use 5-fold cross-validation to observe the predicted model parameters derived from the data on the proportion of measles cases.Then, conduct LASSO regression analysis to examine the sparse LTS model parameter estimates to choose the optimal model.λ values for LTS rarely use 3-fold Cross-Validation.Next, the final stage compares the LASSO and Sparse LTS estimation results by calculating. 2 and RMSE for each estimate and then concluding the results of the analysis carried out.The steps in this research can be explained as follows.
The percentage of measles cases in North Sumatra in 2022 was the subject of data analysis.Thirty-four independent variables and 33 observations make up the dataset, which has one dependent variable.Statistics on the prevalence of measles connected to economic, environmental, human resource, and health statistics are all included in the data structure.The dataset was obtained through a documentation study using official documents from the North Sumatra Health Office and from the official website of the North Sumatra Central Statistics Agency (BPS).

RESULT AND DISCUSSION Descriptive Statistics
All of the variables utilized in this investigation are shown in Figure 2, along with descriptive statistics and the Pearson correlation matrix.Where the value that is highlighted is considered to be the true correlation value and whose value is below the 0.05 level of significance.It can be seen that numerous explanatory variables have relationships with a p-value of less than 0.05, indicating that the connection between these variables is significant.As a result, the data set employed has a multicollinearity issue.

Outliers Detection
Box plot analysis is a way to detect outliers that are depicted graphically from numerical data.The following is a box plot analysis with the help of SPSS software.The box plot output is shown in Figure 3, which shows the existence of outlier and extreme data.Data number 19, marked with a circle (°), is an outlier data.While 12 and 30, with an asterisk (*), are extreme (Figure 3).

Analysis of LASSO
Before looking for the LASSO regression coefficient, determine  using K-Fold Cross Validation.A model validation technique called crossvalidation, known as rotation estimation, determines how well statistical analysis results generalize to different data sets.One such method is k-fold crossvalidation, which divides the data into K equal-sized parts (Rahayu & Husein, 2023).The K-fold crossvalidation formula is: Where is the evaluation value at iteration i, the MSE value is used as the evaluation value, and with the help of RStudio, a plot is displayed depicting the mean squared error (MSE) against Log(λ) in LASSO estimation.5-fold cross-validation was carried out to select λ with the lowest MSE, namely λ=0.3031107.Next, calculate and visualize the best model coefficients for each lambda value evaluated during cross-validation.By conducting regularization path analysis using the LASSO (L1) penalty method, we will calculate the coefficients of the linear regression model with various levels of regularization parameter λ. Figure 5 demonstrates how the L1-norm tends to drop as the LASSO regression seeks to bring the regression coefficient to zero.The total sum of the non-zero coefficients is known as the L1-norm.

Figure 5. Regression coefficients of LASSO plot
Then, using LASSO regression, the regression coefficient is reduced to zero while explanatory factors are chosen, ensuring that only significant explanatory variables are incorporated into the regression model.Outcomes from many variables influencing the percentage of measles cases in North Sumatra were obtained after LASSO analysis.It can be seen from the variable coefficient.Variables that have non-zero coefficients are variables that influence the proportion of measles cases in North Sumatra.With the help of RStudio software, The following table shows the values of each explanatory variable's coefficients as determined by the LASSO analysis.3.276383e-08

Analysis of Sparse Least Trimmed Squares
LASSO regression is paired with one of the most well-liked strong regression estimators to handle high-density data and outliers simultaneously, namely Least Trimmed Squares (LTS), to form the Robust LASSO estimator or what is known as Sparse Least Trimmed Squares.It is using RStudio results in a sparse analysis of LTS model data.The following LTS sparse analysis results can be seen in Table 3.There is a linear regression coefficient that shrinks to zero.A variable that does not significantly affect the response variable is an explanatory variable with a value of zero.Fourteen variables in the LTS sparse model can be used to clarify the percentage of measles cases in North Sumatra.Figure 6 is a plot that shows the relationship between the log10 value of the λ parameter and the RMSE (Root Square Error) value using the sparse LTS method through 3-fold cross-validation.9 λ values are obtained, which are evaluated to select the optimal value that gives the minimum prediction error.The Log(λ) value chosen with a low RMSE is log(10) = 1 with the lowest RMSE being 6.954277.
Then, perform residual analysis of the regression model on the LTS sparse model, which produces a plot showing the standardized residuals vs. the fitted values calculated by the LTS sparse model.Based on observations from the plot, observations 2, 7, 13, 18, 19, and 20 were identified as potential outliers (Figure 7).

Evaluation of the Goodness of Fit Model
The KPI (Key Performance Indicator) value in the form of  2 and RMSE from the LASSO and sparse LTS models is used to select the best model.The following KPIs from the LASSO and sparse LTS models are shown in Table 5.

Figure 6 .
Figure 6.Cross Validation Estimation of Sparse LTS Prediction Error

Figure 7 .
Figure 7. Standardized Residuals vs Fitted Value Model Sparse Lts CONCLUSION This research resulted in the LTS sparse approach used in this study`s modeling to predict the percentage of measles cases in North Sumatra in 2022.Sparse LTS aims to provide a more robust and straightforward model that can efficiently make predictions with less justification.Compared to the traditional LASSO estimator based on R-square and RMSE values, the sparse LTS model successfully chose 14 out of 34 variables, reducing the number of explanatory variables required while maintaining model explanation.Meanwhile, there are 20 variables used, namely Measles Immunization Coverage Percentage ( 1 ), Number of Children Getting Coverage of Vitamin A Aged 12-59 Months ( 4 ), Percentage of Province Area ( 5 ), Number of Babies Born ( 6 ), Low Weight Baby (LWB) ( 7 ), Number of General Hospitals ( 8 ), Number of Integrated Healthcare Center ( 12 ), Number of Doctors ( 13 ), Number of Midwives ( 14 ), Number of Nutritionists ( 15 ), Number of Nurses ( 17 ), Long School Expectations ( 21 ), GRDP Growth Rate ( 23 ), Percentage of Poor Population ( 24 ), Diphtheria, Pertussis and Tetanus (DPT) Immunization Percentage ( 25 ), Labor Force Participation Rate ( 27 ), Human Development Index ( 28 ), Number of Villages/Subdistricts ( 29 ), Population density( 31 ), and Percentage of Households that Have Access to an Improper Source of Drinking Water) ( 34 ).These variables have a high influence on the correlation value of the Pearson variable with other variables.Make these variables not included in the model.According to these variables, the government may use it as a guide to reduce the number of measles cases in North Sumatra.

Table 1 .
Data Description 3 Number of Children Getting Coverage of Vitamin A Aged 6-11 Months People  4 Number of Children Getting Coverage of Vitamin A Aged 12-59 Months

Table 2 .
Regression Coefficient LASSO Estimation Results

Table 3 .
Sparse LTS Estimation Coefficient of RegressionFurthermore, table4shows the selection of λ values in the LTS sparse model through 3-fold crossvalidation.Nine values of λ are evaluated to choose the optimal value that gives the minimum prediction error.

Table 4 .
Results via 3-fold cross-validation of λ and RMSE values

Table 5 .
Results of Evaluation of the LASSO and Sparse LTS Models

Table 5 ,
it can be seen that the sparse LTS model has a coefficient of determination.( 2 ) that is greater than the LASSO model, namely 93.75%.Thus, if you look at the coefficient of determination ( 2 ), Utilizing the sparse LTS model rather than the LASSO model is preferable.Furthermore, if you look at the RMSE, the Sparse LTS model has a smaller value with an RMSE value of 0.2933.Based on the evaluation results, The most effective model for predicting the percentage of measles cases in North Sumatra in 2022 is the sparse LTS model.