DRUG MARKETING AND PHYSICIAN TARGETING!
OVERVIEW
Pharmaceutical companies play a major role in discovering , designing and developing drugs which are used to prevent, treat, and cure many medical issues.
With the increasing competition within pharmaceutical companies and tight regulations by the governments, it has become very difficult to market their drugs and target the physicians to prescribe their drugs once tested properly.
So the role of Sales and marketing in pharmaceutical industry has become more critical in the recent times.
BUSINESS PROBLEM:
As a Pharmaceutical company, Research and development of the drug will take place, and medical representatives are the ones who have to identify the correct physician to market the drug which in-turn increases the profits of the company.
There are various factors which needs to be considered while targeting the Physician and marketing the drug. The factors could be Brand, representative visits, physician affiliation, physician practice, programs attended, the physician location or area he lives in, etc.
BREAKING IT DOWN TO MACHINE LEARNING PROBLEM:
In this article we will discuss how a machine learning model will help in analyzing the data, and predict the right physician to target, so as to market their drug.
TABLE OF CONTENTS
- Data description and Hypothesis generation
2. EDA
— a. Target Distribution
— b. Analyzing categorical variables — Univariate Analysis
— c. Analyzing categorical variables — Bivariate Analysis
— d. Analyzing numerical variables — Univariate Analysis
— e. Analyzing numerical variable s— Bivariate Analysis
3. Choosing the metric for Multi-Class Classification Model
4. Splitting the data and Experimenting on multiple models
— a. Random Model
— b. Logistic Regression
— c. Random Forest
— d. LGBM
5. Feature Engineering
— a. Using PCA components
— b. Using Autoencoder features
6. Custom Ensembling
7. Final Conclusion/ Comparison of the Models
8. Model Deployment and Execution using Streamlit App on Heroku Cloud
9. Future enhancements and References
1. Data description and Hypothesis generation
The description table below shows us the variables which we will be using and build the machine learning model. There are totally 31 variables and Physician segment is the target label.
Source : This dataset is available here.
Hypothesis Generation
Simply put, a hypothesis is a possible view or assertion of an analyst about the problem he or she is working upon. It may be true or may not be true.
- Does the total visits by sales representatives help in predicting/identifying the right physician for drug sales?
- Do the no of Samples dropped help in predicting/identifying the right physician?
- Does the no of Saving Cards dropped help in predicting/identifying the right physician?
- Does the vouchers help in predicting/identifying the right physician?
- Does the no of seminars attended by Physician help in predicting/identifying the right physician?
- Does the physician Hospital affiliation help in predicting/identifying the right physician?
- There are a whole lot of questions we put inorder to get the right information based on which we can proceed further
2 . Exploratory Data Analysis
Before going to any kind of modelling, we will always want to have a look at the kind of data that we have.
Now lets go ahead and check the data we have in pandas dataframe.
Lets also check the dataset shape.
Target Distribution:
Lets check the Target Class Label distribution in the Train dataset.
From the figure below we can see the target distribution and can infer that it is a multi-class target variable.
We observe that the categories ‘High’ and ‘Very High’ constitute major part of the records (approx 71%) and Low and Medium segments constitute (29%).
Analyzing Each Variable & their relationships
There are two types of features that we have:
- Categorical
- Numerical
And also our target is multi-class
For each feature type we will be perform two types of analysis:
- Univariate: Analyze 1 feature at a time
- Bivariate: Analyze the relationship of that feature with target variable, i.e. ‘Physician_segment’
But before jumping straight into analysis, lets have a look at the variables we had and try to get some inferences.
Categorical Features
Univariate Analysis — Pie Charts.
Pie Charts can be useful in seeing the proportion of samples, that fall into each category of a categorical variable. For each of the categorical variables we will make a pie chart based on the train data we have.
So below are the inferences we get from the pie chart
- Physician Gender has almost same distribution
- Physician_speciality has nephrology with 78%, urology 6% and other 16%
- year_quarter has same distribution through all quarters.
Bivariate Analysis Relationships with Target
Here lets perform the Bivariate analysis using the target label as well. In the below code I have segregated categorical and numerical variables.
So Let’s Try to answer few Hypothesis Questions
Q. Does gender impact on the physician segment?
A. Yes, as you can see Very High and High Category percentage is more for Male population, than Female. For Female population we see that Medium and Low constitute more percentage
Q. Does physician speciality impact on the physician segment?
A. Yes the physician with speciality in nephrology tend to prescribe more than the urology and others category
Q. Does year_quarter impact on the physician segment?
A. No
Numerical Features Analysis
As there are multiple numerical variables we can divide them into sub-sections as below:
a. brand_prescribed (binary value column)
b. medical representative and seminar related columns
c. hospital affiliation and prescription for indication related columns
d. patient with insurance related columns
e. brand impressions, web visits related search columns
f. competitor prescriptions related columns
g. Locality related columns
h. physician age and tenure related columns
a. brand_prescribed (binary value column)
Lets take the first numerical variable which has binary data and compare against class labels.
b. medical representative and seminar related columns
Lets take each of the medical representative related columns and perform its analysis with class labels.
In the below code snippet we are checking the distribution using the distplot
Similarly I have done the similar kind of analysis for other medical rep related columns
As the data distribution looks very peaked and skewed towards the right side for few columns lets apply log transform to actually check the distributions.
Log Transformation helped in viewing how the data is distributed in these 6 columns.
Additionally lets go ahead and check the percentile distribution for these columns individually and check its influence on the target label.
Lets take the data upto 90th percentile and plot the box plot to check and see how median and IQR lies for that variable.
As you can see this distribution which is 90% of total data, has a median lying at 6 and IQR lies between 3.5 to 10
Lets perform the bi-variate analysis using box plots using target label as well now.
From this we can infer that for categories ‘Very High’ and ‘High’ median is > 6 and IQR lies at higher range, and for categories ‘Medium’ and ‘Low’ median is <= 5 and IQR range is certainly lower.
Similar kind of analysis is done on the remaining variables.
STATISTICAL ANALYSIS
In addition to it, have also performed analysis using statistical methods like calculating pearson correlation coeff values between variable and target variable, which will help us in figuring out how much a variable can impact the TARGET LABEL
# Reference : https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/from numpy.random import seed
from scipy.stats import pearsonr
seed(1)
coef_val, p_val = pearsonr(data_total_representative_visits,tgt_data)
print('pearsons correlation coefficient: %.3f' % coef_val)
# interpret the significance
alpha_val = 0.05
if p_val > alpha_val:
print('Samples are uncorrelated (fail to reject H0) p=%.3f' % p_val)
else:
print('Samples are correlated (reject H0) p=%.3f' % p_val)
And once the Analysis of each of the variable is done, we are checking the correlation between multiple variables and the target influence.
So Let’s Try to answer few more Hypothesis Questions
Q. Does brand_prescribed impact on the physician segment?
A. Yes, if brand is prescribed the previous quarters, it is more likely that physician will prescribe it in next quarter.
Q. Does total_representative_visits impact on the physician segment?
- Yes, from the distribution chart we see that if the no of representative visits are high, then there is maximum chance that the physician will prescribe the medicine.
- In addition to it, have even checked the percentile of distribution and checked the data below 90th percentile and plotted the box plots and we see that the target variable does get impacted for the 2 categories ‘Very High’ and ‘High’
- Good correlation also with the target label
Q. Does total_sample_dropped impact on the physician segment?
- Yes, it certainly impacts as we are seeing maximum distribution for the 2 segments (Very High and High Categories)
- Same could be inferred here as well if the no of samples dropped are more then there is more chance that the doctor will prescribe
Q. Does saving_cards_dropped impact on the physician segment?
- It really does not make much of a difference as the distribution is peaked at 0 only
- For saving_cards_dropped more than 95% of data is 0 and we cannot make much inference from it
Q. Does vouchers_dropped impact on the physician segment?
- It really does not make much of a difference as the distribution is peaked at 0 only
- For vouchers_dropped more than 95% of data is 0 and we cannot make much inference from it
Q. Does total_seminar_as_attendee impact on the physician segment?
- It really does not make much of a difference as the distribution is peaked at 0 only
- For total_semiar_as_attendee more than 95% of data is 0 and we cannot make much inference from it
Q. Does total_seminar_as_speaker impact on the physician segment?
- It really does not make much of a difference as the distribution is peaked at 0 only
- For total_seminar_as_speaker more than 95% of data is 0 and we cannot make much inference from it
c. hospital affiliation and prescription for indication related columns
Lets take physician affiliation and prescription indicator columns and perform its analysis with class label.
So Let’s Try to answer few more Hypothesis Questions
Q. Does physician_hospital_affiliation impact on the physician segment?
A. Yes, it looks like lot of physicians do not have hospital affiliations and are more likely to prescribe the medicines.
Q. Does physician_in_group_practice impact on the physician segment?
A. Yes, from the distribution chart we see that if the physician is in group setup then he is more likely to prescribe the medicine.
Lets again perform all the steps as we saw before
- pearson corr coef calculation,
- correlation matrix,
- Percentile check,
- and Boxplots against 90th percentile dataset
So Let’s Try to answer few more Hypothesis Questions
Q. How Does total_prescriptions_for_indication1, total_prescriptions_for_indication2, total_prescriptions_for_indication3 impact on the physician segment?
- Have plotted distribution plots for total_prescriptions_for_indication1, and total_prescriptions_for_indication3 and have seen that we have greater distribution for the segment ‘Very High’ and ‘High’, and lesser distribution for ‘Low’ and ‘Medium’.
- For total_prescriptions_for_indication2 we do not see any proper distribution to infer.
- In addition to it, have even checked the percentile of distribution and checked the data below 90th percentile and plotted the box plots and we see that the the IQR Range for the category ‘Very High’ is above all the other 3 categories. This surely helps in using these fields as part of modeling
- Moreover, we also see that total_prescriptions_for_indication1 is higly colinear to total_prescriptions_for_indication3. So during our modeling we will remove total_prescriptions_for_indication3
d. patient with insurance related columns
Lets take insurance related columns and perform its analysis with class labels.
Lets again perform all the steps as we saw before
- distribution charts
- pearson corr coef calculation,
- correlation matrix,
- Percentile check,
- and Boxplots against 90th percentile dataset
So Let’s Try to answer few more Hypothesis Questions
Q. Does total_patient_with_commercial_insurance_plan impact on the physician segment?
- Yes, certainly we see fatter/denser distributions for ‘High’ and ‘Very High’ category
- In addition to it we even checked the percentile of distribution and checked the data below 90th percentile and plotted the box plots and we see that the target variable does get impacted for the 2 categories ‘Very High’ and ‘High’.
- we see that the the IQR Range for the category ‘Very High’ is above all the other 3 categories. This surely helps in using these fields as part of modeling.
- Moderate correlation with the target label
Q. Does total_patient_with_medicare_insurance_plan impact on the physician segment?
- Yes, certainly we see fatter/denser distributions for ‘Very High’ category
- In addition to it we even checked the percentile of distribution and checked the data below 90th percentile and plotted the box plots and we see that the target variable does get impacted for the 2 categories ‘Very High’ and ‘High’. we see that the the IQR Range for the category ‘Very High’ is above all the other 3 categories. This surely helps in using these fields as part of modeling.
- Moderate correlation with the target label
Q. Does total_patient_with_medicaid_insurance_plan impact on the physician segment?
- Moderate correlation with the target label
- In addition to it we even checked the percentile of distribution and checked the data below 90th percentile and plotted the box plots and we see that the target variable does get impacted for the category ‘Very High’ and for remaining categories it is almost same
- We see more no of outliers in the dataset.
e. brand impressions, web visits related search columns
Lets take brand impressions and web related columns and perform its analysis with class labels.
Lets again perform all the steps as we saw before
- distribution charts
- pearson corr coef calculation,
- correlation matrix,
- Percentile check,
- and Boxplots against 90th percentile dataset
- From the pearson correlation coeff calculations, and the above correlation matrix we see that almost all of the fields do not correlate or very less correlation with target variable.
- We also observe that ‘brand_enews_impressions’,’brand_mobile_impressions’ are highly colinear, so we can ignore 1 field while modeling
In the same way, have performed analysis on all the columns one after another.
FINAL EDA CONCLUSIONS:
1. FROM CATEGORICAL VARIABLES EDA
Below are the variables which impact
Q. Does gender impact on the physician segment?
A. Yes, as you can see Very High and High Category percentage is more for Male population, than Female. For Female population we see that Medium and Low constitute more percentage
Q. Does physician speciality impact on the physician segment?
A. Yes the physician with speciality in nephrology tend to prescribe more than the urology and others category
2. FROM NUMERICAL VARIABLES EDA
Below are the variables which impact
Q. Does brand_prescribed impact on the physician segment?
A. Yes, if brand is prescribed the previous quarters, it is more likely that physician will prescribe it in next quarter.
Q. Does total_representative_visits impact on the physician segment?
A. Yes, from the distribution chart we see that if the no of representative visits are high, then there is maximum chance that the physician will prescribe the medicine.
Q. Does total_sample_dropped impact on the physician segment?
A. Yes, it certainly impacts as we are seeing maximum distribution for the 2 segments (Very High and High Categories)
Q. Does physician_hospital_affiliation impact on the physician segment?
A. Yes, it looks like lot of physicians do not have hospital affiliations and are more likely to prescribe the medicines.
Q. Does physician_in_group_practice impact on the physician segment?
A. Yes, from the distribution chart we see that if the physician is in group setup then he is more likely to prescribe the medicine.
Q. Does total_prescriptions_for_indication1, total_prescriptions_for_indication2, total_prescriptions_for_indication3 impact on the physician segment?
A. For total_prescriptions_for_indication1, and total_prescriptions_for_indication3 definitely have greater distribution for the segment ‘Very High’ and ‘High’, and lesser distribution for ‘Low’ and ‘Medium’. For total_prescriptions_for_indication2 we do not see any proper distribution to infer
Q. Does total_patient_with_commercial_insurance_plan impact on the physician segment?
A. Yes, certainly we see fatter distributions for ‘High’ and ‘Very High’ category
Q. Does total_patient_with_medicare_insurance_plan impact on the physician segment?
A. Yes, certainly we see fatter distributions for ‘Very High’ category
Q. Does total_patient_with_medicaid_insurance_plan impact on the physician segment?
A. We have checked the percentile of distribution and checked the data below 90th percentile and plotted the box plots and we see that the target variable does get impacted for the category ‘Very High’ and for remaining categories it is almost same
Q. Does brand search and web search related columns impact on the physician segment?
A. There are about 6 variables, and have tried checking PDF, CDF, box plots, percentiles etc. As we could not find any pattern, cannot make any inference.
Q. Does total_competitor_prescription impact on the physician segment?
A. Plotted the distribution Chart, and Violin plots, which clearly state that it is an important variable for ‘Very High’ Segment. The remaining segments do not impact much.
Q. Does new_prescriptions impact on the physician segment?
A. Plotted the distribution Chart, and Violin plots, which clearly state that it is an important variable for ‘Very High’ Segment. The remaining segments do not impact much. we also see that the the IQR Range for the category ‘Very High’ is above all the other 3 categories. This surely helps in using these fields as part of modeling.
Q. Does locality related columns impact on the physician segment?
A. Could not find out proper pattern, so cannot make any inference.
Q. Does physician age and tenure related columns impact on the physician segment?
A. Could not find out proper pattern, so cannot make any inference. Except for the fact that the ‘Very High’ segment had slightly more tenure and more age for a physician.
So, out of Total 31 variables, we see that 2 categorical variables, and 12 Numerical Variables look more important than other variables
3. Choosing the metric for Multi-Class Classification Model
The Evaluation Factor is multi-class log loss. We need to predict to which class the physician belongs to. Log Loss is a good measure as it penalizes it also higher probabilities if the prediction is wrong
4. Splitting the data and Experimenting on multiple models
Now here before we start experimenting, I am going to split the data into 3 parts, Train, Cross Validation and test data set. We will be using stratified split so that the distribution is same across all 3 datasets.
a. Random Model
In this random model, we will randomly generate 4 values. Lets say 1 belongs to LOW, 2 belongs to MEDIUM, 3 belongs to HIGH and 4 belongs to VERY-HIGH category. We will randomly assign the values as predicted values to class label and check the performance of the model
Our next trained model should definitely perform better than this, which means the Log Loss should be minimized and misclassification percentage should also come down.
b. Logistic Regression
In this model, we are tuning the hyper parameter alpha , and training the model with loss function as LOG LOSS. We are also using the CalibratedClassifer here as it will be convenient to predict the probability of an observation belonging to each possible class.
And yes as shown below we get a better performing model as the no of mis-classified points came down to 42%, and log loss also came down.
The code can be found here.
Tried out SVM as well but could not see much of improvement.
c. Random Forest
Now lets try out Tree based models along with hyper-parameter tuning. Used the best hyper parameters via Random Search CV and trained the model.
we can say that Tree based models are performing better.
d. LGBM
In the similar fashion trained the model on LGBM as well with hyper parameter tuning and chosen the best hyper parameters from the model and trained it.
Additionally we can also see that the confusion matrix, Precision matrix and Recall matrix also looks decent enough.
5. Feature Engineering
As part of feature engineering, picked up
a. top 5 pca components with existing dataset and train the model
In the below code we can see that we have added 5 pca components and trained the model with these new features added.
b. top 10 Autoencoder features with existing dataset and train the model
In the below code we can see that we have added 10 autoencoder features and trained the model with these new features added.
The main usage of Autoencoders is to learn the data and then re-construct the data back again. Here in the code snippet below, where we are defining the input shape, and 2 levels of encoders with batch normalization. And then we have defined bottleneck features as 10, and then decoders of 2 levels to finish the training.
Fitting the model:
# fit the autoencoder model to reconstruct input
history = model.fit(X_train, X_train, epochs=50, batch_size=16, verbose=2, validation_data=(X_cv,X_cv))
# plot loss
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='cv')
plt.legend()
plt.show()
And then in the below code snippet , I am only selecting 10 features (bottleneck features), and then using those for model training.
For both pca components and autoencoder additions did not see much of the performance enhancement.
The regular LGBM model with hyper parameter tuning out-performed all the others.
6. Custom Ensembling
Here I have built a custom ensembling model. Below is the figure which illustrates the same.
STEP 1: Split Train(80%) and Test(20%) datasets
STEP 2: Break the Train into 2 halves as D1 and D2
STEP 3: Pick up data from D1 using random sampling with replacement. Here K can be 4000, 8000, etc.
STEP 4: This Sample dataset is sent to multiple decision trees. Here the models can be 500 trees, 1000 trees, 1500 etc.
STEP 5: Start Training on the models
STEP 6: Send D2 dataset on the trained models to start predicting the labels(green flow)
STEP 7: The Predictions will act as an input to meta-classifier which can be XGB or any model.(green flow)
STEP 8: Again fit these predictions on the metaclassifier and get the output class label.(green flow)
STEP 10: Send Test dataset which was held out, to the models which are trained on D1, and fetch the predictions. (brown flow)
STEP 11: The Predictions from here just as we have seen before, will be sent to meta-classifier as inputs and an output class label is received.
STEP 12: Log Loss Comparison across multiple set of models are done to check what is the optimum no of models to choose
From the below Log Loss graphs we can see that if the random sample size is more the Log Loss is getting reduced, if the sample size is small it is not giving good performance.
For Sample size of 8000 and 4000 log loss is near 0.9, and for others is near 1.0
7. Final Conclusion/ Comparison of the Models
We will compare the performance of each of the model which I have created and see which model is the winner.
Comparing Custom Ensemble performance on various parameters.
From the above we see that instead of Custom ensembling the basic LGBM model without dropping any features gives us a test loss of 0.81
8. Model Deployment and Execution using Streamlit App on Heroku Cloud
We have chosen LGBM models as our model which we want to deploy and performed OneHotEncoding for categorical variables. Later also applied MinMaxScaler on the dataset as we have multiple numerical variables. And then generated pickle file for model.
The final function which helps in predicting is as follows which uses the respective pickle files to pre-process the raw data, transforms the data and finally predicts the class label.
Finally we have built and deployed the model, and are able to predict the class based on the data provided, so as to target the right physician for drug marketing.
This model has being deployed using streamlit on Heroku Cloud App. You can try the execution here.
And the Full execution video on streamlit is in below link.
9. Future enhancements and References
- For further improvements, we can collect more data and train much better models.
- Upon getting more data, we could use advanced deep learning techniques to enhance the model performance even better.
- Can improve the Log Loss and Latency time for response.
REFERENCES:
- www.appliedaicourse.com
- https://machinelearningmastery.com/how-to-calculate-nonparametric-rank-correlation-in-python/
- https://dzone.com/articles/correlation-between-categorical-and-continuous-var-1
- https://machinelearningmastery.com/autoencoder-for-classification/
- www.stackoverflow.com
The full code for this post can be found on Github. I look forward to hearing any feedback or comment.
For few other case studies please refer below URL: