Classification – KNN Model

    Introduction

    In this blog post, I will embark on a journey into the world of machine learning, specifically classification, using a real-world dataset – the Breast Cancer Wisconsin dataset. The goal is to build a classification model that can predict whether a breast tumor is benign (a benign tumor is a non-cancerous growth of cells) or malignant (cancerous growths, life-threatening) based on a set of features derived from cell nuclei characteristics.

    I’ll be using one of the most popular machine-learning libraries in Python, Scitkit-Learn, which provides a comprehensive set of tools for machine learning.

    Throughout this blog post, I will cover the following key points: (1) Data Exploration, (2) Model Building, (3) Model Evaluation, (4) KNN Tuning, and (5) Model Improvement.

    1.0 Data Exploration

    We’ll start by exploring the Breast Cancer Wisconsin dataset, gaining an understanding of its structure, and the meaning of each feature. We’ll use Seaborn to visualize the data, uncover patterns, and get a feel for the dataset.

    To learn more about the datasets, including documentation and definitions of each column, visit Kaggle.

    Observations:

    • There are two target variables, which are binary (0 for benign and 1 for malignant).
    • There are 569 rows and 31 columns in the dataset.
    • There are 357 rows where the target is 1 and 212 rows where the target is 0.
    • There are no missing values in the dataset.

    The code above generates a part plot using Seaborn for three specific (“mean radius,” “mean texture,” and “mean area”) from the Breast Cancer Wisconsin dataset. This pair plot is color-coded by the target variable, which indicates whether tumors are benign or malignant (0 for benign and 1 for malignant).

    By examining the scatter plots, we can look for clusters, patterns, and separations between the two classes (benign and malignant tumors).

    Observations:

    • Examining the scatter plot located in the bottom-left corner, we can observe that when the “mean radius” is less than ~17 and the “mean area” is under 1000, there’s a higher likelihood of the tumor being classified as malignant (1).
    • Analyzing the chart displayed above the chart mentioned is the prior observation, it becomes apparent that when the “mean radius” is less than 17, the likelihood of malignancy is evident. However, the role of “mean texture” in determining malignancy is less discernible when its value is either 0 or 1.

    2.0 Model Building

    Following the Data Exploration step, we transition into the model-building phase. Here, we employ the scikit-learn library to create a KNN classifier. The dataset is divided into a training set and a testing test using train_test_split function.

    The KNN classifier is then instantiated with a specified number of neighbors (k), in this case, set to 5. The model is trained on the training data, and predictions are made on the test data.

    3.0 Model Evaluation

    After creating the KNN model, we assess the performance of the KNN model. Model evaluation is crucial to determine how well the model generalizes on unseen data and how accurately it classifies breast tumors as benign or malignant. The evaluation metrics help us gain insights into the model’s strengths and limitations.

    Observations:

    • Accuracy: 0.956140350877193
      • Accuracy is a measure of how many predictions the model got correct out of the total predictions made. In this case, the accuracy is approximately 95.61%, indicating that the model correctly classified about 95.61% of the data points.
    • Confusion Matrix:
      • A confusion matrix is a table that describes the performance of a classification model. It provides information about the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). In this case, the confusion matrix is as follows:
        • True Positives (TP): 71 cases were correctly classified as Class 1.
        • True Negatives (TN): 38 cases were correctly classified as Class 0.
        • False Positives (FP): 5 cases were incorrectly classified as Class 1 when they were actually Class 0.
        • False Negatives (FN): 0 cases were incorrectly classified as Class 0 when they were actually Class 1.
    • Classification Report:
      • The classification report provides a summary of various metrics for each class (Class 0 and Class 1) and overall performance:
        • Precision: The precision measures the proportion of true positive predictions out of all positive predictions made by the model. For Class 0, the precision is 1.00, indicating that when the model predicts Class 0, it is highly likely to be correct. For Class 1, the precision is 0.93, indicating that when the model predicts Class 1, it is correct in 93% of cases.
        • Recall (Sensitivity): The recall measures the proportion of true positive predictions out of all actual positives in the dataset. For Class 0, the recall is 0.88, meaning that the model correctly identifies 88% of actual Class 0 instances. For Class 1, the recall is 1.00, indicating that the model correctly identifies all actual Class 1 instances.
        • F1-Score: The F1-score is the harmonic mean of precision and recall, providing a balance between these two metrics. For Class 0, the F1-score is 0.94, and for Class 1, it is 0.97. The weighted average F1-score (0.96) accounts for the class distribution in the dataset.
        • Support: The support represents the number of instances in each class in the dataset.

    Overall, these evaluation metrics suggest that the KNN model is performing well, with high accuracy, good precision, recall, and F1-scores for both classes.

    4.0 KKN Tuning

    We saw above that using 5 neighbors resulted in an accuracy score of 0.9561, but is 5 neighbors the optimal number?

    One of the essential aspects of optimizing a machine learning model is tuning its hyperparameters. The n_neighbors hyperparameter plays a significant role in KNN, as it determines how many nearest data points should be considered when making predictions. While we initially trained the model with 5 neighbors, we recognize that this may not be the optimal number.

    To make the performance evaluation more accessible and visually informative, we employ Matplotlib. We record and compare the accuracy of the model on both the training and testing data for different values of n_neighbors. These accuracy values are then visualized in a plot, providing a clear depiction of how the model’s accuracy behaves as we adjust the number of neighbors.

    As seen in the chart above, as we increase n_neighbors, the training accuracy gradually decreases, which is expected because with more neighbors, the model generalizes better and is less prone to overfitting.

    Test accuracy tends to increase as n_neighbors increases, peaking at a certain point and then possibly decreasing if the model becomes too biased.

    When n_neighbors is very low (e.g., 1 or 2), the model overfits the training data, resulting in high training accuracy but lower test accuracy. It means the model is too sensitive to noise in the training data.

    There is a sweet spot for n_neighbors around 11 or 12 where the test accuracy is maximized. This is the point where the model generalizes well to unseen data without underfitting or overfitting.

    As we increase n_neighbors beyond the optimal value, test accuracy tends to stabilize or slightly decrease, indicating that the model may become overly biased and start to underfit.

    In this case, n_neighborsof 11 or 12 appears to be a good choice, as it provides the highest test accuracy and a reasonable training accuracy.

    5.0 Model Improvement

    By modifying the code to set the number of neighbors to 11, we see an improved accuracy score from 0.9561 using 5 neighbors, to 0.9824 using 11 neighbors. This change enhances the model’s ability to make accurate predictions based on the nearest data points in the feature space.

    6.0 Conclusion

    We began by delving into the Breast Cancer Wisconsin dataset during the Data Exploration step, gaining insights into its structure and uncovering distinctive patterns. Following this, we harnessed the K-Nearest Neighbors (KNN) algorithm, renowned for its simplicity and efficacy in classification tasks. Our model was trained on the training data and applied to make predictions on the test data.

    Following this, we carried out a model evaluation of the model’s performance, an essential step to grasp its strengths and limitations. This assessment yielded an accuracy score of 95.61%. In our pursuit of optimization, we thoroughly examined the model’s performance by experimenting with different values of n_neighbors. Through this exploration, we determined that configuring n_neighbors to 11 was the most effective choice. Implementing this adjustment led to a model refinement, significantly boosting the accuracy score to an impressive 98.24%.

    It is important to consider the model’s limitations, such as:

    • Model Selection: While KNN is used in this analysis, other machine learning algorithms could be more suitable for this specific task. A broader exploration of different models and techniques may yield better results, such as Logistic Regression, Decision Trees, Support Vector Machines (SVM), and Naive Bayes.
    • Feature Selection: The dataset contains a relatively large number of features (30 features) which were all used in the model. Feature selection or dimensionality reduction techniques may be necessary to identify the most relevant features for classification. High-dimensional data can lead to overfitting.
    • Generalization: While the K-Nearest Neighbors (KNN) model might perform well on this dataset, its generalization to other datasets or real-world clinical scenarios may be limited. It’s essential to evaluate models in various contexts to ensure their broader applicability.
    • Data Quality: Although this dataset is considered clean, real-world healthcare data can be noisy and contain missing values. Preprocessing healthcare data can be a challenging task, and the quality of the data can significantly impact model performance.