End-To-End ML Model

    This post describes the steps for building and deploying a machine learning model for heart disease prediction. From dataset exploration and feature engineering to model training, evaluation, and Docker deployment, this blog post covers the essential steps in the machine learning lifecycle, providing practical insights and code examples along the way.

    Step 1: Load & Explore Dataset

    The Python code below imports the necessary libraries and proceeds to load the heart disease dataset, which will serve as the foundation for the content in this blog post.

    Step 2: Exploratory Data Analysis (EDA)

    Exploratory Data Analysis is a preliminary phase in data analysis where the focus is on understanding the main characteristics of a dataset, uncovering patterns, and identifying trends using statistical and visual methods.

    The provided Python code initially displays all column names along with the count of non-null values and data types. Subsequently, it utilizes the Seaborn library to create visualizations illustrating the distribution of age, the correlation between age and serum cholesterol level, the maximum heart rate categorized by gender, and the distribution of chest pain types.

    The visualizations above illustrate the age distribution, the correlation between age and serum cholesterol level, the maximum heart rate categorized by gender, and the distribution of chest pain types. As observed in the bottom right corner, a notable insight is that if the Chest Pain Type is greater than 0.0, there is a higher likelihood of heart disease compared to non-heart disease.

    Step 3: Data Preparation

    Data preparation is a crucial step in the data analysis process, involving the cleaning, transforming, and organizing of raw data into a format suitable for analysis.

    The Python script presented below eliminates the ‘oldpeak’ column from the data frame due to its absence of values. Following this, an assessment is conducted to identify columns containing null values, revealing that cp, chol, and restecg exhibit such gaps. Given the categorical nature of cp and restecg, the script proceeds to remove rows with null values in these columns. Concurrently, it fills null values in the chol column with the mean value.

    Step 4: Feature Engineering

    Feature engineering is the process of creating new, meaningful features or transforming existing ones in a dataset to enhance the performance and interpretability of machine learning models.

    Normalization:

    • One common feature engineering techniques is normalization. Normalization scales numeric features to a scale of 0 to 1 ensuring that no particular feature can dominate the model due to its scale. This is beneficial when features have different ranges and you use algorithms sensitive to the input’s scale like K-Nearest Neighbors (KKN) or Neural Networks.

    Standardization:

    • Another common feature engineering techniques is standardization. Standardization scales feature to have a mean of zero and a variance of one. Standardization benefits algorithms that assume features are centered around zero and have variance in the same order, like in Support Vector Machines (SVMs) and Linear Regression.

    In the Python code below, we will be using standardization for the numerical fields to scale features to have a mean of zero and a variance of one.

    Step 5: Feature Selection

    Feature selection is a process in machine learning where a subset of relevant and significant features (variables) is chosen from the original set of features. The goal is to improve model performance, reduce overfitting, and enhance interpretability by focusing on the most important attributes.

    RandomForestClassifier can be used for feature selection inherently through the way it constructs decision trees. Random Forest is an ensemble learning method that builds multiple decision trees and combines their predictions. During the construction of each tree, a subset of features is randomly selected for consideration at each split.

    As seen in the chart above, the feature that has the highest importance to heart disease is cp (chest pain type), followed by ca (number of major vessels colored by fluoroscopy) and thal (thalassemia).

    Step 6: Model Training & Prediction

    In this section, we will use logistic regression to train and predict if a patient has a heart disease. Logistic regression is a supervised machine learning algorithm used for binary classification problems, predicting whether an instance belongs to one of two classes.

    Then Python code below imports the necessary libraries from the scikit-learn module. LogisticRegression is employed for creating the logistic regression model, and metrics such as accuracy, classification report, and confusion matrix are assessed using functions from sklearn.metrics.

    An instance of the logistic regression model (log_reg_model) is initialized. Logistic regression is a commonly used algorithm for binary classification tasks. The logistic regression model is trained using the fit method, where it learns the patterns and relationships within the transformed training features (X_train_transformed) and their corresponding labels (y_train).

    Once the model is trained, predictions are generated on the original test features (X_test). Note that there seems to be a typo in the code (model.transform), and it should likely be log_reg_model.transform. The accuracy score, confusion matrix, and a detailed classification report are computed to evaluate the performance of the trained logistic regression model on the test set (y_test and X_test_transformed).

    The model’s performance metrics are reported as follows:

    1. Accuracy: 0.78:
      • The accuracy metric indicates that the model correctly predicted the class labels for approximately 78% of the instances in the test set.
    2. Confusion Matrix:
      • The confusion matrix provides a detailed breakdown of the model’s predictions. It reveals that:
        • True Positive (TP): 82 instances were correctly predicted as Class 1.
        • True Negative (TN): 70 instances were correctly predicted as Class 0.
        • False Positive (FP): 23 instances were incorrectly predicted as Class 1.
        • False Negative (FN): 21 instances were incorrectly predicted as Class 0.
    3. Classification Report:
      • Precision: Precision measures the accuracy of positive predictions. For Class 0, precision is 0.77, and for Class 1, precision is 0.78.
      • Recall: Recall (Sensitivity or True Positive Rate) assesses the model’s ability to identify all relevant instances. Class 0 has a recall of 0.75, and Class 1 has a recall of 0.80.
      • F1-Score: The F1-score is the harmonic mean of precision and recall. Class 0 has an F1-score of 0.76, and Class 1 has an F1-score of 0.79.
      • Support: Indicates the number of actual instances for each class.
    4. Overall Summary:
      • The weighted average accuracy across both classes is 0.78.
      • The macro average, providing equal weight to both classes, for precision, recall, and F1-score is approximately 0.77.
      • The weighted average, considering class imbalance, for precision, recall, and F1-score is also 0.78.

    Step 7: MLflow

    MLflow is an open-source platform designed to manage the end-to-end machine learning lifecycle. It provides tools and components to help with various stages of the machine-learning process, including experimentation, reproducibility, and deployment. MLflow was developed by Databricks but is now an open-source project with a community-driven development model.

    Logging Experiments

    MLflow logging refers to the process of recording and tracking various aspects of machine learning experiments using MLflow, an open-source platform for managing the end-to-end machine learning lifecycle. Logging in MLflow involves recording information such as parameters, metrics, artifacts, and model details during the execution of machine learning code. This logged information is crucial for reproducibility, collaboration, and model management.

    Here are the key components of MLflow logging:

    1. Parameters:
      • MLflow allows you to log the parameters used in your machine learning experiments. Parameters are the configuration settings or hyperparameters that influence the behavior of your models.
      • Example: mlflow.log_param("model_type", "Logistic Regression")
    2. Metrics:
      • Metrics represent quantitative measurements of the performance of your models. Common metrics include accuracy, precision, recall, F1 score, etc.
      • Example: mlflow.log_metric("accuracy", accuracy)
    3. Artifacts:
      • Artifacts are the output files or results generated during an experiment. This can include model files, plots, images, and any other relevant outputs.
    4. Model Logging:
      • MLflow provides functions to log and save machine learning models in a standardized format. This allows you to easily reproduce and deploy models later.
      • Example: log_model(log_reg_model, "log_reg_model")
    5. Start and End Runs:
      • MLflow uses the concept of runs to represent individual executions of your machine learning code. You can start a run using mlflow.start_run() and end it using mlflow.end_run().

    The following Python code demonstrates the integration of logistic regression with MLflow logging.

    Step 8: K-Fold Cross Validation

    K-fold cross-validation is a technique used in machine learning to assess the performance of a predictive model. The dataset is split into “K” equally sized folds, and the model is trained and evaluated “K” times. During each iteration, one of the folds is used as the test set, and the remaining (K-1) folds are used as the training set. This process is repeated K times, with a different fold designated as the test set in each iteration.

    The Python code below uses scikit-learn’s cross_val_score to perform 5-fold cross-validation on a logistic regression model (log_reg_model_cv) trained with transformed features (X_train_transformed), calculating accuracy scores for each fold and displaying both the individual scores and the mean cross-validation accuracy.

    The output shows the accuracy scores obtained from 5-fold cross-validation for a logistic regression model. The individual accuracy scores for each fold are displayed as an array: [0.764, 0.801, 0.808, 0.763, 0.827], representing the accuracy of the model on each subset of the training data.

    The mean cross-validation accuracy, calculated by averaging these individual scores, is presented as 0.79. This mean accuracy provides an overall assessment of the model’s performance across different training and testing subsets, serving as a more reliable indicator of its generalization capabilities compared to a single train-test split. In this context, an accuracy of 0.79 suggests that, on average, the model correctly predicted the class labels for approximately 79% of the instances during the cross-validation process.

    Step 9: Pre-Deployment

    In this phase, the model is tested before deployment. This testing phase is crucial for identifying and addressing potential issues, bugs, or performance bottlenecks that could impact the reliability and functionality of the model in a live production environment.

    unittest is a testing framework in Python that is part of the Python Standard Library. It provides a set of conventions and methods for writing and running tests to verify the correctness of code. The unittest module, inspired by Java’s JUnit, allows developers to create and execute test cases, organize tests into test suites, and perform various types of assertions to check whether the expected behavior of the code under test matches the actual behavior.

    In the Python code below we use the unittest function to check if the output of the model is either 0 (no heart disease) or 1 (heart disease).

    The output indicates that the unit test ran successfully, and your test case (TestLogisticRegression) passed without any errors or failures.

    Step 10: Deployment

    Docker is a platform that enables developers to package and distribute applications, along with their dependencies, in a consistent and reproducible manner. In the context of machine learning (ML) deployment, Docker provides a way to encapsulate an ML model, its dependencies, and the runtime environment into a container.

    To package the model using Docker, you’ll need to follow these general steps:

    • Step 1: Create a Dockerfile: This file contains instructions for building a Docker image. In your case, it would include the necessary dependencies and setup for running your model. Create a file named Dockerfile (no file extension) in the same directory as your Python script. Add the following content:
    • Step 2: Create a requirements.txt file: If you have external dependencies (e.g., specific versions of libraries), list them in a requirements.txt file. Create a file named requirements.txt and list the required packages:
    • Step 3: Build the Docker image: Open a terminal, navigate to the directory containing your Dockerfile, and run the following command to build the Docker image:
    • Step 4: Run the Docker container: After the image is built, you can run a container based on that image. This command maps port 4000 on your host machine to port 80 in the Docker container. Adjust the ports as needed.

    The model should now be packaged and running in a Docker container. This is a basic example, and depending on the actual requirements, we might need to customize the Dockerfile or take additional steps for a more complex setup.

    Conclusion

    In conclusion, this blog post walked through various crucial steps in the machine learning lifecycle, using a heart disease prediction task as an example. Here’s a summary of the key points covered:

    1. Step 1: Load & Explore Dataset:
      • Import necessary libraries and load the heart disease dataset.
      • Display basic information about the dataset.
      • Visualize key aspects using Seaborn.
    2. Step 2: Exploratory Data Analysis (EDA):
      • Understand the dataset’s main characteristics, patterns, and trends.
      • Visualize distributions and correlations in the data.
    3. Step 3: Data Preparation:
      • Clean and organize the data.
      • Handle missing values and perform necessary transformations.
    4. Step 4: Feature Engineering:
      • Normalize and standardize numerical features.
      • Enhance the performance and interpretability of machine learning models.
    5. Step 5: Feature Selection:
      • Use RandomForestClassifier for feature selection.
      • Identify and focus on the most important features.
    6. Step 6: Model Training & Prediction:
      • Train a logistic regression model.
      • Evaluate model performance using accuracy, confusion matrix, and classification report.
    7. Step 7: MLflow Logging:
      • Log experiments, parameters, metrics, and artifacts using MLflow.
      • Facilitate reproducibility, collaboration, and model management.
    8. Step 8: K-Fold Cross Validation:
      • Assess model performance using K-fold cross-validation.
      • Obtain accuracy scores for different training and testing subsets.
    9. Step 9: Pre-Deployment Testing:
      • Use unittest to perform unit testing.
      • Check the validity of model predictions.
    10. Step 10: Deployment with Docker:
      • Package the model, dependencies, and runtime environment into a Docker container.
      • Create a Dockerfile and build the Docker image.
      • Run the Docker container for deployment.

    By following these steps, from data exploration to deployment, we have covered essential aspects of an end-to-end machine learning project.