Diabetes Patient Re-Admission Prediction
Problem Statement:
The Problem Statement here is, to identify if the patient will again come back for medication or not, based on the the mentioned feature variables which are described as below.
1. Data description and Hypothesis Generation
Below are the variables given to us for predicting if the Diabetic patient will be re-admitted to the hospital.
- We can see that there are about 15+ variables(considering extra telemetry variables) which can be used for modelling.
- Variable encounter_id is an identifier column. It has a unique value for every sample in the data set and cannot be used for modelling.
- Variable ‘diabetesMed’ is Target/y column. It has binary values and we need to predict this variable given 15+ variables as features.
HYPOTHESIS GENERATION
Simply put, a hypothesis is a possible view or assertion of an analyst about the problem he or she is working upon. It may be true or may not be true.
- Are older patients more likely to take medication as compared to younger patients ?
- Are patients with certain Race affect the medication?
- Does weight of the patient affect the medication?
- Does admission type id affect the medication?
- Does discharge disposition id affect the medication?
- Does admission source id affect the medication?
- Does time in hospital affect the medication?
2. EDA
Before going to any kind of modelling, we will always want to have a look at the kind of data that we have.
We have been provided a file, with information:
Train.csv: We will use this file for training our model. It contains variables or features that we will input to our model, and the target variable that we want to predict.
We will split the data into Train data set, Cross validation data set and Test data set and perform the job.
Now lets go ahead and check the data we have.
Dataset Shape (Number of Samples and Variables in the dataset)
2(a). Target Distribution
This is a binary classification problem. Lets have a look at the number of positive and negative examples that we have, or in our problem statement terms: ‘Number of People who came back for medication, and number of people who did not’
Visually we are able to see communicate things more clearly and graphs help us do that. Lets visualize the same target distribution in a countplot.
Quite obviously dataset is very imbalanced. About 89% of the examples are positive, and only 11% are negative.
Checking and Displaying the Unique values in each variable to better understand the data.
As you can see that there are variables which have more than 100 unique values and few have lesser values.
Analyzing Each Variable & their relationships
There are two types of features that we have:
- Categorical
- Numerical
And also our target is Binary
For each feature type we will be perform two types of analysis:
- Univariate: Analyze 1 feature at a time
- Bivariate: Analyze the relationship of that feature with target variable, i.e. ‘diabetesMed’
But before jumping straight into analysis, lets have a look at the variables we had and try to ask some questions ourselves
2(b). Analyzing Categorical Variables — Univariate Analysis
From the pandas dataframe, lets pick up only the categorical Variables and perform analysis
Univariate Analysis — Pie Charts.
Pie Charts can be useful in seeing the proportion of samples, that fall into each category of a categorical variable. For each of the categorical variables we will make a pie chart.
And there are few more pie charts for remaining categorical variables in the same fashion.
So below are the inferences we get from the pie chart
- Race column has about 2 % records with ‘?’ values
- Weight column has about 98% records with ‘?’ values
- tel_1 column has about 17 % records with ‘?’ values
- tel_2 column has about 55 % records with ‘?’ values
Pre-Processing and Transforming of data for few columns as we need good data for the models to produce good results.
As part of this step we are :
- Replacing ‘?’ with ‘UNKNOWN’ value
- Dropping the variables which are having > 15% of data as nulls[weight and tel_2 variables]
Data looks more neat now, after cleansing as shown in figure below.
2(c). Analyzing Categorical Variables — Bivariate Analysis
Here lets perform the Bivariate analysis using the target label as well.
And there are few more bar charts for remaining categorical variables in the same fashion.
Now Lets try to answer few Hypothesis Questions:
Q: Are Older patients more likely to take medication as compared to younger patients?
A: From the above bar plots we see that patients above 50 are more prone to medication
Q: Are patients with certain race affect with medication?
A: we see that AfricanAmerican get affected more, next comes asian population. Caucasians and hispancis are not much affected from the insights we see(volume of data for them is huge).
2(d). Analyzing Numerical Variables — Univariate Analysis
1. Univariate Analysis — Boxplots
Boxplot can be used to see the spread of the numerical variables, and identify outliers
So here we will pick up only numerical variables for our analysis.
So we build the box plots using the below code
And there are few more bar plots for remaining numerical variables in the same fashion.
2(e). Analyzing Numerical Variables — Bivariate Analysis
Bivariate Analysis using Horizontal BarPlots
For each numerical Variable. We will plot the median of the numerical variable for:
When diabetesMed == 0 when diabetesMed == 1 We are choosing median since median is not affected by outliers, and our data has a lot of outliers
Lets try to answer few more hypothesis Questions:
Q: Does admission Type id affect the medication?
A: No it does not as per the insights from the graph
Q: Does discharge disposition id affect the medication?
A: No it does not as per the insights from the graph
Q: Does admission source id affect the medication?
A: No it does not as per the insights from the graph
Q: Does time in hospital affect the medication?
A: Yes it does, as it is showing that people who spend more time in hospital are more likely to comeback for medication
3. Splitting the data and making the model ready
Here we will split the data into Train, Validation and Test data set as shown in code snippet below.
3(a). For Categorical Variables we will use One-Hot-Encoding.
One Hot Encoding for race column as shown below
As shown in the above example we follow the same method on all the remaining Categorical variables like ‘gender’,’age’ etc.
3(b). For Numerical Variables we will use Normalization.
Normalization for admission_type_id column
As shown in the above example we follow the same method on all the remaining Numerical variables.
Finally Concatenating all the features using hstack
So the next step is choosing the Metric and hyper parameter tuning based on the model chosen
4. Choosing the Metric for Classification Model
We will train our model on Decision Trees, Logistic Regression and SVM and check how each of it is performing on the Binary Classification.
We have chosen AUC-ROC score as a metric as it is good for binary classification problems.
What is the AUC-ROC curve?
The Receiver Operator Characteristic (ROC) curve is an evaluation metric for binary classification problems. It is a probability curve that plots the TPR against FPR at various threshold values and essentially separates the ‘signal’ from the ‘noise’. The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. For more information refer this URL:
5. Decision Tree Classifier
We will first choose Decision Tree classifier here and train our model. But before we train we first need to perform hyper-parameter tuning.
It can be done by 2 methods, either by Grid-Search or Random Search.
we have defined a couple of helper functions which can help in searching the best hyper parameters. Detailed level Code is shown in GITHUB link.
5(a). Hyper-parameter Tuning
Code snippet for Grid-Search is shown here. We are taking max_depth and min_samples_split as hyper-parameters for Decision Tree
Calculating ROC_AUC score using GridSearch for Train Data
Showcasing the AUC score using Heatmap (max_depth, min_samples_split)
5(b). AUC Plots
Selecting the Best hyper parameters and test the performance of the model on test data, and plot the ROC Curves
we see that the best parameters with Good AUC score are:
max_depth_best= 5
min_samples_split_best=5
The AUC Plots for Train and Test data is as shown below.
5(c). Confusion Matrix
Selecting the threshold and predicting the probabilities, and building the Confusion Matrix for Train and Test Data
we will define our own helper functions here to predict based on threshold values.
And then plot the confusion matrix using the below code snippet.
6. Logistic Regression — Predictions
In the same fashion we will train our model on different Classification Algorithm which is Logistic Regression.
And before that we tune for best hyper-params which are C and penalty here.
Choosing the best params which gives us good AUC score.
6(a). AUC Plots
Testing the performance of the model (Log. Reg) on test data, plotting ROC Curves using best hyper param : penalty and alpha values
penalty=’l2'
alpha=0.01
6(b). Confusion Matrix
In the same fashion we will use the same code snippet and plot the confusion matrix.
7. SVM — Predictions
we will also train our model on SVM as well.
And before that we tune for best hyper-params which are C and penalty here as well.
7(a). AUC Plots
Testing the performance of the model (SVM) on test data, plotting ROC Curves using best hyper param : penalty and alpha values
penalty=’l2'
alpha=0.0001
7(b). Confusion Matrix
In the same fashion we will use the same code snippet and plot the confusion matrix.
8. Final Summary
Finally we summarize as below with all the approaches, by using different models.
The full code for this post can be found on Github. I look forward to hearing any feedback or comment.
For few other case studies please refer below URL: