Abstract
Optimization in machine learning involves hyperparameter tuning, where models are iteratively updated by adjusting weights to minimize loss. The tuning process explores combinations of learning rates and step sizes to converge toward a global minimum, thereby improving model performance. This study aims to enhance heart disease prediction accuracy using various optimization algorithms.
Heart disease classification involves multiple interdependent features, requiring advanced ensemble techniques and hyperparameter optimization for accurate results. The research evaluates model performance using metrics such as AUC, precision, recall, F1 score, and confusion matrix, comparing default and optimized models.
Bayesian optimization (via Hyperopt) and genetic algorithm-based tuning (using TPOT) were applied to train models across 5 to 10 generations. Additionally, Optuna was used in conjunction with the Random Forest algorithm for fine-tuned optimization. Among the models tested, Bayesian-optimized Support Vector Machine (SVM) achieved the highest accuracy (90%), followed by Random Forest models tuned with Bayesian optimization (89%) and default Random Forest (86.6%). Genetic Algorithm SearchCV achieved 88.5% accuracy with 10 generations, while TPOT with 5 generations reached 86.8%. In contrast, Optuna-optimized SVM had the lowest accuracy (84%).
This study also compares execution time and performance across optimized models using a comprehensive evaluation of accuracy, precision, recall, F1-score, macro average, and confusion matrix. Exploratory data analysis and preprocessing, including one-hot encoding and standard scaling, were applied to both 13-feature and extended 31-feature datasets. Compared to default machine learning models, Gaussian (84%) and logistic regression (83%) outperformed dummy classifiers (54%), showcasing significant improvement.
Data Preprocessing and Normalization
The heart disease dataset used in this study is sourced from the Open Source Cleveland dataset by David W. Aha. Originally consisting of 14 features, the dataset was expanded to 76 features using one-hot encoding. Categorical features such as sex were encoded (1 for male, 0 for female), and chest pain types were classified into four categories: typical angina, atypical angina, non-anginal pain, and asymptomatic.
Other encoded features include:
-
trestbps: Resting blood pressure (mm Hg) -
chol: Serum cholesterol (mg/dL) -
smoke: Smoking status (1 = yes, 0 = no) -
restecg: Resting ECG results (0 = normal, 1 = abnormal, 2 = left ventricular hypertrophy) -
exang: Exercise-induced angina (1 = yes, 0 = no) -
slope: Slope of peak exercise ST segment (0 = upsloping, 1 = flat, 2 = downsloping) -
target: Indicates heart disease (1 = disease present, 0 = no disease)
For preprocessing, duplicate records were removed, and missing values were handled using mean imputation. Two scaling techniques were considered: Min-Max normalization and Standard Scaling. However, standard scaling was preferred and applied using the formula:
This standardization centers features around a mean of 0 and standard deviation of 1, making it suitable for data with varied units such as blood pressure and hemoglobin levels. Unlike normalization that compresses data within a [0, 1] range, standard scaling preserves the data distribution more effectively.
This preprocessing pipeline enabled the transformation of datasets from 13 to 31 features, ensuring comprehensive model training and evaluation.
Figure 1: Data Preparation Flowchart
-
Download Dataset (13 features)
-
Remove Duplicates
-
Apply One-Hot Encoding (Categorical Columns)
-
Concatenate New Encoded Features
-
Target Column Set
-
Final Feature Set (31 features)