Diabetes Detection System by Mixing Supervised and Unsupervised Algorithms

Diabetes mellitus is also called gestational diabetes when a woman has high blood sugar while pregnant. It can show up at any time during pregnancy and cause problems for the mother and baby during or after the pregnancy. If the risks are found and dealt with as soon as possible, there is a chance that they can be reduced. The healthcare system is one of the many parts of our daily lives that are being rethought thanks to the creation of intelligent systems by machine learning algorithms. In this article, a hybrid prediction model is suggested to determine if a woman has gestational diabetes. The recommended model reduces the amount of data using the K-means clustering method. Predictions are made using several classification methods, such as decision trees, random forests, SVM, KNN, logistic regression, and naive Bayes. The results show that accuracy increases when clustering and classification are used together.


Introduction
Diabetes mellitus during pregnancy, also known as gestational diabetes, is one of the most frequent complications that may arise, affecting about one in every six live births globally [1,2]. Any degree of glucose inability, with the onset of symptoms occurring during pregnancy, is the definition of this condition.
Alehegn, Joshi, and Alehegn [4] discussed that GD may strike at any point during pregnancy, producing complications for the mother and baby throughout pregnancy and after the baby is born. The potential consequences will be reduced if the risks are recognized and mitigated early. Algorithms for machine learning are being used to create intelligent systems, which are bringing about changes in every facet of our life.
Artificial intelligence makes the lives of patients, clinicians, and hospital managers simpler by automating tasks humans normally carry out [3]. The current goals for AI include reducing the number of incorrect diagnoses of diabetes and developing more effective medical treatment methods [8]. These days, healthcare systems create vast volumes of data. Thus, more complex systems rely on this data to construct more precise models. The purpose of this work is to attempt to construct a model that can examine both current and historical instances of diabetes to forecast and diagnose future cases using machine learning techniques, which may be of use to patients and hospital administrators [5]. Moreover, we attempted to determine, via machine learning algorithms, which model has the highest degree of precision with the least amount of mistakes. In the Iraqi Kurdistan Region, we attempted to work with a data set obtained from governmental and private labs. It is important to note that this was our goal.
The information we have gathered consists of the subjects' ages, weights, heights, gestational numbers, family medical history, and the results of their diabetes tests. This information reveals at what ages and circumstances pregnant women are more likely to develop gestational diabetes. The remains of this paper are structured as follows: Section 2 explains the backgrounds and several studies of related work. Section 3 describes the methodology of the proposed model, and Section 4 discusses the results. Finally, section 5 expresses the conclusion.

Background and Literature Review
This section explains the backgrounds of machine learning methods used in this research and several studies of related works.

Machine learning (ML)
Smarter processing, operation, and optimization of future communication networks will be required as the scope of applications grow [7,8]. ML, also known as AI vessels, must be integrated into designing, planning, and optimizing future communication networks to accomplish this intellectual processing and process vision.

Supervised learning
Supervised learning techniques are used to create a mathematical model from a dataset. The data inputs and expected results are included in this model. The relevant data is known as training data, including training instances. One or more inputs and the desired output (also known as a supervisory signal) make up each training sample. While a matrix describes the training data, a vector or array represents each training example in the model. Arrays and vectors may sometimes be referred to as "feature vectors."

Clustering Algorithm
Clustering is a technique for machine learning that involves dividing a collection of data points or a set of data points into numerous groups, with data points that are part of the same group being more similar to one another than data points that are part of different groups are to data points that are part of different groups. This section is going to explain the procedure for clustering. Put another way, and the goal is to build clusters out of groups with characteristics in common with other types [2]. Clustering is a procedure that, when applied to the field of data science, has the potential to be of assistance. The strategy that is used in this approach [18] involves analyzing the degree of dissimilarity that exists between the clusters as well as the amount of similarity that exists within each cluster. This approach aims to establish whether or not a data collection has a cluster structure of some kind. In order to reduce the amount of information we were responsible for analyzing as part of our study project, we used the k-means approach, the clustering algorithm that is the most well-known and regularly used at present. Throughout the study that has been carried out, the k-means algorithm has been improved in several different ways in order to accommodate the findings. The k-means method and its extensions that include a certain minimum number of clusters will always, in some way, be influenced by the initializations that are performed.

Unsupervised learning
Unsupervised learning algorithms are given an unlabeled dataset and tasked with discovering patterns and relationships within it by grouping or clustering individual data points. As a result, the algorithms learn from test data that has not been preprocessed in any manner (such as by labeling or categorizing). Instead of responding to feedback, unsupervised learning algorithms actively seek patterns in the data and adjust their behavior accordingly. They act in this manner rather than reacting to criticism. One of the most useful uses of unsupervised learning is finding the probability density function. Although unsupervised learning may be used in various contexts, including those that call for summarizing and analyzing data, one such context is density estimation in statistics.

Decision Tree
A decision tree is an algorithmic tool that may be used in data segmentation depending on certain data properties. This is what the appearance of one of the algorithms for supervised learning was like. In order to create a mental model capable of reliably predicting the value of a target variable, the goal is to get acquainted with the core decision tree instructions as quickly as possible. It is well suited for education that continues throughout one's life. The rules are often formulated as if-then-else clauses, and the decision tree is structured to correspond to these [19]. The categorization method uses decision trees, which do not call for a significant amount of processing capacity from the computer. Continuous data is made easier using the decision tree, a handy tool [5].

Random Forest
Random Forest is a method for the reduction of dimensionality that uses a large number of decision trees to construct a classification. This is an example of an ensemble approach that may be used for classification and other problems [8]. It is possible to use it to order the variables in terms of their significance [7].

Support Vector Machine
Support vector machines are used to locate a hyperplane line that separates the positive and negative samples by the largest feasible margin [6]. The objective of the support vector machine is to locate, in the space occupied by the Multidimensional Features, a hyperplane capable of correctly classifying the data points (SVM). An SVM model in which the instances are treated as points in space and the space is mapped in such a manner that the examples of the various categories are separated by as much space as is physically practicable is called a space-variant point model (SVPM). Following that, additional instances are mapped into the same region, and a category is given to them based on where in the gap they are placed.

K-Nearest Neighbors
Only a local version of KNN is simulated, and calculations are put off until after the classification procedure has been completed. The KNN algorithm [11] is a machine learning algorithm that is considered one of the most basic types available.

Logistic Regression
The generalized linear model is also known as logistic regression, another term for it. The linear components of nonlinear functions and the link function are subdivided into nonlinear functions. The link function is responsible for delivering the linear portion of the output that the classification model generates.
In logistic regression, a logistic function is applied to the linear output to deal with it. The logistic function never produces results outside of the range of 0 to 1 [1].

Naïve Bayes
The Naive Bayes method is a classification strategy that can deal with missing values while it is in the process of classifying data. In order to categorize them, a supervised learning approach is used. The premise that Naive Bayes works based on conditional probability is this statistical methodology's central  [20]. The Naive Bayes method effectively works against noise points when calculating conditional probabilities.

Literature Review
This section briefly discusses several works on intelligent systems and machine learning methods for modelling and predicting different types of diabetes diagnoses.
For diabetes prediction, Al-Zebari, and Sengur [1] analyzed and compared the outcomes of different machine learning strategies. In this study, the MATLAB classification learner tool was utilized, and several machine learning techniques such as decision trees, support vector machines, K-nearest neighbor, and logistic regression were utilized. When it comes to performance measurements, the outcomes are judged according to how accurately they are classified.
Jader, and Aminifar [2] proposed a fast and accurate artificial neural network for predicting new cases of diabetes. This research designed an artificial neural network prediction model for recognizing diabetes mellitus. The criterion of this paper was to minimize the error function in neural network training using a neural network model design. After designing the ANN model, the error rate of the neural network decreased during training, based on its designing process, and the accuracy of the prediction increased to 91%.
The results showed that the ANN algorithm improves prediction accuracy based on its network design.
Alapati, and Sindhu [3]  Sonar and Malini [20] have discussed the link between high blood sugar levels and the development of diabetes. In addition, many distinct intelligent systems that use classifiers to anticipate and anticipate diabetes utilizing machine learning methods such as decision trees, SVMs, Naive Bayes, and ANN algorithms were shown. Based on these groupings, diabetes risk might be estimated.

Methodology
Combinations of machine learning techniques are used in the model that has been suggested. According to the flow chart shown in Figure 1, the model employs the supervised approach for data reduction. For prediction, the model also incorporates unsupervised techniques such as the classification algorithms: decision tree, random forest, SVM, KNN, logistic regression, and naive Bayes.

Data Collection
These apps that have been upgraded rely on the large amounts of data gathered by healthcare systems and laboratories worldwide [12]. This allows the updated apps to work more effectively and deliver more accurate results. Our models were trained using data from a range of governmental and commercial laboratories inside the Iraqi Kurdistan Region. These labs were located across Iraq. The data collection included

Feature Extraction
Fe extraction is a great technique to apply when you need to reduce the number of features required for processing but do not want to lose crucial or relevant information. This may be accomplished with the aid of feature extraction. Another possible advantage of feature extraction is removing duplicated information from within the dataset [19]. The body mass index may be calculated based on the datasets we have for weight and height, which represent and are utilized to compute it (BMI). As seen in Equation 1, there was a substantial and positive connection between BMI and weight, but there was a significant and negative correlation between BMI and height. These results were found to be statistically significant.

Data Pre-Processing
One of the ways of processing data is known as data normalization [15]. This method is used to convert unintelligible data into understandable data collection. Normalize data is a scaling or mapping method for transforming unnormal data into normal data [16]. The Z-score method, which is the process of normalizing every value in the dataset, was included in our model using the Z-score methodology. The formula for normalizing z-scores can be found in Equation 2, while the formula for calculating standard deviation, which was required to produce a z-score, can be found in Equation 3.

Clustering Algorithm
Clustering is a machine learning method that divides a collection of data points into numerous groups, with data points within the same group being more similar to one another than data points in other groups being to data points in other groups. The clustering method is going to be explained here. Put another way, and the objective is to form clusters out of groupings that share features with other categories [2]. In the realm of data science, clustering is a process that may prove to be helpful. Analyzing the degree of dissimilarity between the clusters and the amount of similarity within each cluster is the method used in this approach [18]. This method aims to determine whether or not a data collection has a cluster structure. We used the k-means technique, the clustering algorithm that is the most well-known and commonly utilized currently available, to cut down on the quantity of data that we required to evaluate for our research project.
The k-means algorithm has been expanded upon in various ways during the research that has been carried out. The k-means approach and its expansions with a required minimum number of clusters will always be affected by initializations in some manner. On the other hand, clustering may be performed by unsupervised learning [13], which is used in pattern recognition and machine learning. When using an approach that is not supervised, choosing an adequate number of clusters within which the data may be classified is of the highest need. In order to determine the value of k that should be applied to KMeans clusters, we used The Elbow Method, one of the  Figure 2, and calculating the WCSS (Within-Cluster Sum of Square) points for each value of K. This is what is meant by the term "counting the number of clusters that are represented by (K)".

Classification Algorithm
The outputs of the clustering step, in which data is reduced and organized into a specific group, serve as an input for the classifier algorithms. The classification stage follows this stage. These methods include assigning a sample of the data used in the stage before this one to a cluster of fresh input data that is analogous to the cluster the sample used for in the stage before this one. The newly designed system incorporates several classification strategies, including a decision tree, a random forest, a support vector machine (SVM), a K-Nearest Neighbor (KNN), logistic regression, and naive Bayes. Other strategies include naive Bayes and naive decision trees.

Prediction System
The code for the suggested model was written using the computer language Python, and a confusion matrix was used so that several classification techniques could be evaluated and compared to one another.
In the course of our investigation, we used several different comparison criteria, including accuracy, preci-

Result Analysis
Using a combination of supervised and unsupervised algorithms in this research presents a reliable method for predicting models that may be used to recognize gestational diabetes mellitus. In the beginning, we solely used classification techniques; after that, we used a combination of unsupervised KMeans clustering algorithm and classification methods in the recommended model. According to Table 2, the proposed model provides more accurate results than the currently used classification methods for pregnant women with diabetes.       We compare the proposed model to some previously obtained accuracy from intelligent models created by existing research. Table 9. shows that, in most cases, the proposed model (shown in Table 2.) has better accuracy.

Conclusions
The dataset we used to build our proposed model came from public and private laboratories in the Kurdistan Region of Iraq. The data collection had 730 different instances and 7 different characteristics. To recognize gestational diabetes, a hybrid prediction model within the framework of the suggested model has been created. The K-means clustering algorithm is employed in the recommended model to reduce the amount of data. Several classification methods are used for classification and prediction, including decision trees, random forests, SVM, KNN, logistic regression, and naive Bayes. The accuracy acquired via each technique of supervised and unsupervised algorithms improved performance compared to previous works.
According to the data, combining clustering with classification results in an improvement in accuracy.

Declaration of Competing Interest:
The authors declare they have no known competing interests.