Diabetes Patient Re-Admission Prediction

10 min readNov 7, 2020

Diabetes Patient Re-admission Prediction: Source: My own

Problem Statement:

The Problem Statement here is, to identify if the patient will again come back for medication or not, based on the the mentioned feature variables which are described as below.

1. Data description and Hypothesis Generation

Below are the variables given to us for predicting if the Diabetic patient will be re-admitted to the hospital.

We can see that there are about 15+ variables(considering extra telemetry variables) which can be used for modelling.
Variable encounter_id is an identifier column. It has a unique value for every sample in the data set and cannot be used for modelling.
Variable ‘diabetesMed’ is Target/y column. It has binary values and we need to predict this variable given 15+ variables as features.

HYPOTHESIS GENERATION

Simply put, a hypothesis is a possible view or assertion of an analyst about the problem he or she is working upon. It may be true or may not be true.

Are older patients more likely to take medication as compared to younger patients ?
Are patients with certain Race affect the medication?
Does weight of the patient affect the medication?
Does admission type id affect the medication?
Does discharge disposition id affect the medication?
Does admission source id affect the medication?
Does time in hospital affect the medication?

2. EDA

Before going to any kind of modelling, we will always want to have a look at the kind of data that we have.

We have been provided a file, with information:

Train.csv: We will use this file for training our model. It contains variables or features that we will input to our model, and the target variable that we want to predict.

We will split the data into Train data set, Cross validation data set and Test data set and perform the job.

Now lets go ahead and check the data we have.

Dataset Shape (Number of Samples and Variables in the dataset)

2(a). Target Distribution

This is a binary classification problem. Lets have a look at the number of positive and negative examples that we have, or in our problem statement terms: ‘Number of People who came back for medication, and number of people who did not’

Visually we are able to see communicate things more clearly and graphs help us do that. Lets visualize the same target distribution in a countplot.

Quite obviously dataset is very imbalanced. About 89% of the examples are positive, and only 11% are negative.

Checking and Displaying the Unique values in each variable to better understand the data.

As you can see that there are variables which have more than 100 unique values and few have lesser values.

Analyzing Each Variable & their relationships

There are two types of features that we have:

Categorical
Numerical

And also our target is Binary

For each feature type we will be perform two types of analysis:

Univariate: Analyze 1 feature at a time
Bivariate: Analyze the relationship of that feature with target variable, i.e. ‘diabetesMed’

But before jumping straight into analysis, lets have a look at the variables we had and try to ask some questions ourselves

2(b). Analyzing Categorical Variables — Univariate Analysis

From the pandas dataframe, lets pick up only the categorical Variables and perform analysis

Univariate Analysis — Pie Charts.

Pie Charts can be useful in seeing the proportion of samples, that fall into each category of a categorical variable. For each of the categorical variables we will make a pie chart.

And there are few more pie charts for remaining categorical variables in the same fashion.

So below are the inferences we get from the pie chart

Race column has about 2 % records with ‘?’ values
Weight column has about 98% records with ‘?’ values
tel_1 column has about 17 % records with ‘?’ values
tel_2 column has about 55 % records with ‘?’ values

Pre-Processing and Transforming of data for few columns as we need good data for the models to produce good results.

As part of this step we are :

Replacing ‘?’ with ‘UNKNOWN’ value
Dropping the variables which are having > 15% of data as nulls[weight and tel_2 variables]

Data looks more neat now, after cleansing as shown in figure below.

2(c). Analyzing Categorical Variables — Bivariate Analysis

Here lets perform the Bivariate analysis using the target label as well.

And there are few more bar charts for remaining categorical variables in the same fashion.

Now Lets try to answer few Hypothesis Questions:

Q: Are Older patients more likely to take medication as compared to younger patients?

A: From the above bar plots we see that patients above 50 are more prone to medication

Q: Are patients with certain race affect with medication?

A: we see that AfricanAmerican get affected more, next comes asian population. Caucasians and hispancis are not much affected from the insights we see(volume of data for them is huge).

2(d). Analyzing Numerical Variables — Univariate Analysis

1. Univariate Analysis — Boxplots

Boxplot can be used to see the spread of the numerical variables, and identify outliers

Box plot

In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through…

en.wikipedia.org

So here we will pick up only numerical variables for our analysis.

So we build the box plots using the below code

And there are few more bar plots for remaining numerical variables in the same fashion.

2(e). Analyzing Numerical Variables — Bivariate Analysis

Bivariate Analysis using Horizontal BarPlots

For each numerical Variable. We will plot the median of the numerical variable for:

When diabetesMed == 0 when diabetesMed == 1 We are choosing median since median is not affected by outliers, and our data has a lot of outliers

Lets try to answer few more hypothesis Questions:

Q: Does admission Type id affect the medication?

A: No it does not as per the insights from the graph

Q: Does discharge disposition id affect the medication?

A: No it does not as per the insights from the graph

Q: Does admission source id affect the medication?

A: No it does not as per the insights from the graph

Q: Does time in hospital affect the medication?

A: Yes it does, as it is showing that people who spend more time in hospital are more likely to comeback for medication

3. Splitting the data and making the model ready

Here we will split the data into Train, Validation and Test data set as shown in code snippet below.

3(a). For Categorical Variables we will use One-Hot-Encoding.

One Hot Encoding for race column as shown below

As shown in the above example we follow the same method on all the remaining Categorical variables like ‘gender’,’age’ etc.

3(b). For Numerical Variables we will use Normalization.

Normalization for admission_type_id column

As shown in the above example we follow the same method on all the remaining Numerical variables.

Finally Concatenating all the features using hstack

So the next step is choosing the Metric and hyper parameter tuning based on the model chosen

4. Choosing the Metric for Classification Model

We will train our model on Decision Trees, Logistic Regression and SVM and check how each of it is performing on the Binary Classification.

We have chosen AUC-ROC score as a metric as it is good for binary classification problems.

What is the AUC-ROC curve?

The Receiver Operator Characteristic (ROC) curve is an evaluation metric for binary classification problems. It is a probability curve that plots the TPR against FPR at various threshold values and essentially separates the ‘signal’ from the ‘noise’. The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. For more information refer this URL: