Can you predict the number of likes a “YOUTUBE” video will get?
As YouTube becomes one of the most popular video-sharing platforms, YouTuber is developed as a new type of career in recent decades. YouTubers earn money through advertising revenue from YouTube videos, sponsorships from companies, merchandise sales, and donations from their fans. In order to maintain a stable income, the popularity of videos become the top priority for YouTubers.
Meanwhile, some of our friends are YouTubers or channel owners in other video-sharing platforms. This raises our interest in predicting the performance of the video. If creators can have a preliminary prediction and understanding on their videos’ performance, they may adjust their video to gain the most attention from the public.
So here we have been provided with few details on videos along with some features as well.
PROBLEM STATEMENT:
Accurately predicting the number of likes for each video using the set of input variables is our “Problem statement”.
1. Data description and Hypothesis Generation
Below are the variables given to us for predicting the likes which is the target variable.
Hypothesis Generation
Simply put, a hypothesis is a possible view or assertion of an analyst about the problem he or she is working upon. It may be true or may not be true.
- Do videos with more views get more likes ?
- Do videos with more comments get more likes ?
- Do with videos with more dislikes get less likes ?
- Do longer videos get more likes than shorter videos ?
- Do descriptive videos get more number of likes ?
- Does a channel affect the number of likes?
- Does the country of origin affect the number of likes ?
- Do people post more videos on weekends than weekdays?
2. EDA
So we start here with Exploratory Data Analysis. A brief look at the data for Train and Test is as follows.
As you can see in the “train” data we see the number of likes of the “youtube” video, but for “test” data we do not see the number of likes. This is our task to predict the accurate likes based on variables given to us against each of the record.
This is not a classification, but a Regression Problem here as the value is continuous(for every single record).
3. Target Distribution
This is a regression problem. Let’s look the at the ‘likes’ distribution.
Question
Highly Right Skewed Data.
What can we do to change this distribution and make it more normal ?
Answer: Transform the values to Log values.
4. Analyzing each variable & their relationships.
There are 4 types of features that we have:
- Numerical
- Categorical
- Textual
- DateTime
Also our target is Continuous
For each feature type we will be perform two types of analysis:
- Univariate: Analyze 1 feature at a time.
- Bivariate: Analyze the relationship of that feature with target variable, i.e. ‘likes’.
4(a). Analyzing Numerical Variables
Assigning numerical variables to num_cols list.
Univariate Analyis
Applying “Log Transformation” of numerical variables as it is easy to understand the distribution of numerical variables and plotting the Density Plots.
Bivariate Analysis (for numerical columns)
For Bivariate analysis, Correlation HeatMaps are apt.
Now, lets try Answering Hypothesis
1. Do videos with more views get more likes ?
Yes they do, we have a high correlation of 0.65 also the plot between the two variables shows this.
2. Do videos with more comments get more likes ?
Yes they do, we have a high correlation of 0.73 also the plot between the two variables shows this.
3. Do videos with more dislikes get less likes ?
Any form of popularity is good popularity. As the number dislikes increases, number of views increases too, and so the number of likes.
4. Do longer videos get more likes than shorter videos ?
We don’t have data to answer this question.
We Should try to collect more data, and also see what other features could be helpful ?
4(b). Analyzing Categorical Variables
In the same way we will do univariate and Bi-variate analysis of Categorical variables
Univariate Analysis
Pie Charts
Assigning categorical variables to cat_cols list.
Using the below code snippet to plot the Pie chart to see the categorical data distribution.
Bivariate Analyis:
we will perform Bivariate analysis by picking up categorical columns and find out the Country Wise Number of Videos for Channel. Below is the code snippet.
Fetching Top 10 channels for each country
Multivariate Analysis:
Here we will Analyze more than two variables at once. It will also include the Target variable this time which is likes.
Country Wise Likes for Channel
Mean Likes Per Country
Based on the above information we even calculated the mean likes per country.
Question
Does a country affect the number of likes?
Looks like videos posted in England have an higher average number of likes compared to videos posted in India.
4(c). Analyzing Date Variables
In the similar fashion lets analyze date variables.
Minimum and Maximum Date
Value Counts of Videos Year Wise
From the above information we see that no of videos are more after 2017. Lets see the pattern here.
Lets Dissect the data even more by comparing against each country
Mean Number of Likes by Country sorted by Date
Lets answer another hypothesis question:
Do people post more videos on weekends than weekdays ?
It looks our hypothesis is incorrect in context of the current data. Most of the videos were published on Friday, but Saturday and Sunday saw the least number of videos published.
4(d). Analyzing Text Variables
same as we have defined numerical and categorical list of variables, we are preparing a list for Text variables as well.
Do descriptive videos get more number of likes ?
Lets try to answer this question using the correlation heat maps.
Answer: So if you keep your title length short and description long, you are having higher chances of getting likes.
5. Defining the metric for Regression Model
RMSLE(Root Mean Squared Log Error)
Simply put RMSLE is the Root Mean Squared Error of the logarithm of actual and predicted values.
The Robustness of RMSLE to the outliers, the property of calculating the relative error between the Predicted and Actual Values, the most unique property of the RMLSE that it penalizes the underestimation of the actual value more severely than it does for the Overestimation.
Read More about RMSLE:
we will be using the below code snippet/function to check the RMSLE score against each of the models we train and try to bring it down as we move on.
6. Model building — Using Numerical+Categorical Variables
In this section we will discuss on adding just the numerical and categorical variables only.
Lets define all of them in below format:
Concatenating both Train and Test data inorder to apply the transformation on all required variables.
Numerical Variables: Log Transforming the numerical columns.
Categorical Variables: we will use , get_dummies() as it is used for data manipulation. It converts categorical data into dummy or indicator variables.
Splitting Train data into Train and Validation set and predicting on Test data
In the first step we will split the combined data again into Train and Test using the below code.
In this 2nd step, we are splitting the Train into Train and Validation set on 80:20 basis.
Just to check back if the categorical variables are properly manipulated to indicator or dummy variables.
we see that the category_id(in green) has being modified based on no of unique values, and country_code(in blue) also has being modified based on no of unique values within it as shown in picture.
In the above step we are collecting the features from X_trn and removing the features which are not text variables and date variables.
Building model using Linear Regression and predicting the no of likes using just numerical and categorical columns.
After we fit the X_trn(train), we are predicting on X_val(validation dataset), and then using the rmsle function we are checking the score which is 873.06
And then we are using the same model to predict on the test data as shown below and downloading the file.
Checking the result set(Base Model), to see how many likes were predicted against each of the video_id. We see that for 1st one - 619 likes, for 2nd one — 2248 likes, etc.
7. Model building — Using Numerical+Categorical+Date Variables
Now, lets do some feature engineering using Date variables and them to the model and predict.
So based on Publish date , we will extract the day of week, month and year.
And then adding all the variables except for TEXT variables and predict.
Here this time we will use the Gradient Boosting Regressor to predict and also use the best hyper parameters after the tuning.
Below code snippet is shown how to use, but the internal code is provided in GITHUB Link.
So the function which is being used here also provides us with Top 20 most useful features which help in predicting the Target Labels.
So as you can see this time the RMSLE score came down to 665.08 which is a good improvement.
8. Model building — Using Numerical+Categorical+Date+Text Variables
Finally lets add the Text Variables also and predict.
We will use the most basic method which is BOW(Bag of Words) method on all Textual variables. Below example uses the Textual Variable — ‘description’ for BOW approach on it.
In the same pattern we will use BOW on other text variables like ‘tags’ and ‘title’.
We see that in the below figure the score has taken a big leap, and RMSLE score came down to 512.02, It also lists down the top 20 features.
9. Final Summary
Finally we summarize as below with all the approaches, by using different models and adding each set of variables one-by-one.
The full code for this post can be found on Github. I look forward to hearing any feedback or comment.
For few other case studies please refer below URL: