KNN of CKD

Author Information

Author: Zacks Shen
Github: https://github.com/ZacksAmber/KNN-of-CKD
Blog: https://zacks.one
LinkedIn: https://www.linkedin.com/in/zacks-shen/


Introduction

Chronic Kidney Disease

One in Seven American Adults Estimated to Have Chronic Kidney Disease

This is a Machine Learning project for predicting Chronic Kidney Disease (CKD) by using K-Nearest Neighbors (KNN).

The project is about forecasting if a person has CKD based on two or three attributes from Hemoglobin, Glucose, and White Blood Cell Count. If the one has CKD the Class is 1.


Package Introduction

The module I developed is KNN, which is included in the package datascientists. The minimum version of datascientists must greater than or equal to 0.0.5.
KNN can predict the Class or any category marked as 1 and 0. It accepts at least 2 attributes (the columns of your pandas DataFrame) and returns the following results:

  • Pandas DataFrame: The predicted class of test.
    • The validation method for the predictions.
  • Plotly graph object: A static 2 dimensional scatter plot with train data or predictions or Decision Boundary.
  • Plotly graph object: A static 3 dimensional scatter plot with train data or predictions.
  • Plotly graph object: An animated 2 dimensional scatter plot to present the process that how KNN algorithm finds the neareast neighbor. (k=1, or k>1).
  • Pandas DataFrame: The best k, the mean of k calculated by default 100 times from random train-test datasets, based on the user-specified attributes.

Dataset

ckd.csv

Original Dataset

Age Blood Pressure Specific Gravity Albumin Sugar Red Blood Cells Pus Cell Pus Cell clumps Bacteria Blood Glucose Random ... Packed Cell Volume White Blood Cell Count Red Blood Cell Count Hypertension Diabetes Mellitus Coronary Artery Disease Appetite Pedal Edema Anemia Class
0 48 70 1.005 4 0 normal abnormal present notpresent 117 ... 32 6700 3.9 yes no no poor yes yes 1
1 53 90 1.020 2 0 abnormal abnormal present notpresent 70 ... 29 12100 3.7 yes yes no poor no yes 1
2 63 70 1.010 3 0 abnormal abnormal present notpresent 380 ... 32 4500 3.8 yes yes no poor yes no 1
3 68 80 1.010 3 2 normal abnormal present present 157 ... 16 11000 2.6 yes yes yes poor yes no 1
4 61 80 1.015 2 0 abnormal abnormal notpresent notpresent 173 ... 24 9200 3.2 yes yes yes poor yes yes 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
153 55 80 1.020 0 0 normal normal notpresent notpresent 140 ... 47 6700 4.9 no no no good no no 0
154 42 70 1.025 0 0 normal normal notpresent notpresent 75 ... 54 7800 6.2 no no no good no no 0
155 12 80 1.020 0 0 normal normal notpresent notpresent 100 ... 49 6600 5.4 no no no good no no 0
156 17 60 1.025 0 0 normal normal notpresent notpresent 114 ... 51 7200 5.9 no no no good no no 0
157 58 80 1.025 0 0 normal normal notpresent notpresent 131 ... 53 6800 6.1 no no no good no no 0

Selected Columns and Standardized Dataset

Hemoglobin Glucose White Blood Cell Count Class
0 -0.865744 -0.221549 -0.569768 1
1 -1.457446 -0.947597 1.162684 1
2 -1.004968 3.841231 -1.275582 1
3 -2.814879 0.396364 0.809777 1
4 -2.083954 0.643529 0.232293 1
... ... ... ... ...
153 0.700526 0.133751 -0.569768 0
154 0.978974 -0.870358 -0.216861 0
155 0.735332 -0.484162 -0.601850 0
156 0.178436 -0.267893 -0.409356 0
157 0.735332 -0.005280 -0.537686 0

The last step of data preprocessing is randomly and evenly split dataset ckd into two DataFrames: train and test.

We will use the train to build the KNN model. Module KNN can forecast the class of the test based on the KNN model. Because we already know the class of test. We can compare the class and predicted class to evaluate the accuracy of the model.


Visualizations

2D Static Scatter

Hemoglobin vs. Glucose has higher accuracy than White Blood Cell Count vs. Glucose for predicting Class of CKD.

Because the latter plot shows that the separation between the two attributes is not so clean.


3D Static Scatter

Although we already know which two attributes have better accuracy, how about three of them?

The 3D Scatter: KNN of CKD shows that the accuracy of three attributes seems not better than Hemoglobin vs. Glucose. Therefore, for resource-saving, we will focus on Hemoglobin vs. Glucose.


2D Animated Scatter

Let’s see how module KNN works.

For k=1, KNN will find 1 nearest train point to the test point. Then assign the same Class as the 1 nearest train point to the test point.

For k>1, KNN will find k nearest train points to the test point. Then assign the most frequent Class of the k nearest points to the test point.

For higher resolution, please run KNN-of-CKD.ipynb on your local machine.


2D Static Scatter with Decision Boundary

What if all of the points on the plot are available for predicting?
Assume we have unlimited data from patients and the data have huge variations. The Decision Boundary can simulate this scenario.
You can see the transparent dots and opaque dots on the scatter plot.
The boundary between them is Decision Boundary.
Any new point that more close to the side of blue will be classified as CKD; vice versa, Any new point that more close to the side of gold will be classified as healthy.


Predictions and Best k

Now we have the best attributes. And we know how KNN works.

But what k is proper to forecast the Class?

KNN has a method best_k to help the user make a decision.

Let’s see the current predictions.

Hemoglobin Glucose White Blood Cell Count Class Predicted Class
0 -0.030400 -0.376028 0.809777 0.0 0.0
1 0.213242 -0.484162 -0.569768 0.0 0.0
2 -3.685029 0.689873 -0.986840 1.0 1.0
3 0.700526 -0.406923 0.617283 0.0 0.0
4 1.292227 -0.175206 -0.473521 0.0 0.0
... ... ... ... ... ...
74 -1.283416 -0.221549 3.408455 1.0 1.0
75 -2.223178 1.910252 0.424788 1.0 1.0
76 0.282854 -0.298788 0.296458 0.0 0.0
77 -2.083954 0.643529 0.232293 1.0 1.0
78 -1.283416 -0.947597 3.344290 1.0 1.0

We can also calculate the accuracy of the model.

For each row in DataFrame predictions, we can count the number of column Predicted Class that is the same as the column Class. Then let it divided by the length of the DataFrame test. The result is the accuracy under the current train test distribution.

The current Accuracy is 0.9746835443037974.


So let’s see how KNN makes a prediction for each row.

The 1st row of the DataFrame test is the 1st row in the DataFrame predictions. The Class is 0, and the Predicted Class is 0, too.

Hemoglobin Glucose White Blood Cell Count Class
119 -0.0304 -0.376028 0.809777 0

The top k(k=5) nearest neighbors to the 1st row in test.

Hemoglobin Glucose White Blood Cell Count Class distance
137 0.039212 -0.530506 -0.666015 0 0.169438
156 0.178436 -0.267893 -0.409356 0 0.235171
78 -0.065206 -0.623193 0.039798 0 0.249604
60 0.074018 -0.144310 0.328540 0 0.254158
120 -0.239236 -0.221549 -1.051005 0 0.259761

We can call function _distance to get the top k nearest neighbors. All of the neighbors are 0, so the Predicted Class of this point is 0.


The predict function works well.

It’s time to calculate the best k.

The best_k method concatenates train and test in the KNN instance. So KNN instance has a complete DataFrame ckd. Then it splits the ckd into train and test for 100 times. For each repetition, it stores the current Accuracy according to the current k. Finally, it returns the average Accuracy for each k.

For example, I passed k = 5 to the best_k function. Then it returned the average accuracy of 100 repetitions for k from 1 to 5.

k Average Accuracy of 100 Boostrap
0 1 0.983671
2 3 0.982025
1 2 0.981139
4 5 0.977975
3 4 0.977468

Conclusion

With k=1 and the attributes of Hemoglobin and Glucose, We have 98.56% accuracy for predicting if a person has CKD by applying the KNN algorithm.

The CKD dataset has a clear separation in Hemoglobin and Glucose. Thus we only need these two attributes with k=1 to make an accurate prediction. For more complicated datasets, we can use more than two attributes and let the KNN module find the best k for prediction.