Can you predict the number of likes a “YOUTUBE” video will get?

11 min readNov 6, 2020

As YouTube becomes one of the most popular video-sharing platforms, YouTuber is developed as a new type of career in recent decades. YouTubers earn money through advertising revenue from YouTube videos, sponsorships from companies, merchandise sales, and donations from their fans. In order to maintain a stable income, the popularity of videos become the top priority for YouTubers.

Meanwhile, some of our friends are YouTubers or channel owners in other video-sharing platforms. This raises our interest in predicting the performance of the video. If creators can have a preliminary prediction and understanding on their videos’ performance, they may adjust their video to gain the most attention from the public.

So here we have been provided with few details on videos along with some features as well.

PROBLEM STATEMENT:

Accurately predicting the number of likes for each video using the set of input variables is our “Problem statement”.

1. Data description and Hypothesis Generation

Below are the variables given to us for predicting the likes which is the target variable.

Hypothesis Generation

Simply put, a hypothesis is a possible view or assertion of an analyst about the problem he or she is working upon. It may be true or may not be true.

Do videos with more views get more likes ?
Do videos with more comments get more likes ?
Do with videos with more dislikes get less likes ?
Do longer videos get more likes than shorter videos ?
Do descriptive videos get more number of likes ?
Does a channel affect the number of likes?
Does the country of origin affect the number of likes ?
Do people post more videos on weekends than weekdays?

2. EDA

So we start here with Exploratory Data Analysis. A brief look at the data for Train and Test is as follows.

As you can see in the “train” data we see the number of likes of the “youtube” video, but for “test” data we do not see the number of likes. This is our task to predict the accurate likes based on variables given to us against each of the record.

This is not a classification, but a Regression Problem here as the value is continuous(for every single record).

3. Target Distribution

This is a regression problem. Let’s look the at the ‘likes’ distribution.

Question

Highly Right Skewed Data.

What can we do to change this distribution and make it more normal ?

Answer: Transform the values to Log values.

4. Analyzing each variable & their relationships.

There are 4 types of features that we have:

Numerical
Categorical
Textual
DateTime

Also our target is Continuous

For each feature type we will be perform two types of analysis:

Univariate: Analyze 1 feature at a time.
Bivariate: Analyze the relationship of that feature with target variable, i.e. ‘likes’.

4(a). Analyzing Numerical Variables

Assigning numerical variables to num_cols list.

Univariate Analyis

Applying “Log Transformation” of numerical variables as it is easy to understand the distribution of numerical variables and plotting the Density Plots.

Log Transforming the numerical variables

code snippet for Density plots on numerical variables

Bivariate Analysis (for numerical columns)

For Bivariate analysis, Correlation HeatMaps are apt.

Now, lets try Answering Hypothesis

1. Do videos with more views get more likes ?

Yes they do, we have a high correlation of 0.65 also the plot between the two variables shows this.

2. Do videos with more comments get more likes ?

Yes they do, we have a high correlation of 0.73 also the plot between the two variables shows this.

3. Do videos with more dislikes get less likes ?

Any form of popularity is good popularity. As the number dislikes increases, number of views increases too, and so the number of likes.

4. Do longer videos get more likes than shorter videos ?

We don’t have data to answer this question.

We Should try to collect more data, and also see what other features could be helpful ?

4(b). Analyzing Categorical Variables

In the same way we will do univariate and Bi-variate analysis of Categorical variables

Univariate Analysis

Pie Charts

Assigning categorical variables to cat_cols list.

Using the below code snippet to plot the Pie chart to see the categorical data distribution.

Bivariate Analyis:

we will perform Bivariate analysis by picking up categorical columns and find out the Country Wise Number of Videos for Channel. Below is the code snippet.

Fetching Top 10 channels for each country

Multivariate Analysis:

Here we will Analyze more than two variables at once. It will also include the Target variable this time which is likes.

Country Wise Likes for Channel

Country and Channel wise — Likes (Canada)

Country and Channel wise — Likes (India)

Mean Likes Per Country

Based on the above information we even calculated the mean likes per country.

Question

Does a country affect the number of likes?

Looks like videos posted in England have an higher average number of likes compared to videos posted in India.

4(c). Analyzing Date Variables

In the similar fashion lets analyze date variables.

Minimum and Maximum Date

Value Counts of Videos Year Wise

From the above information we see that no of videos are more after 2017. Lets see the pattern here.

Lets Dissect the data even more by comparing against each country

Mean Number of Likes by Country sorted by Date

mean likes by country for Canada and Great Britain

Lets answer another hypothesis question:

Do people post more videos on weekends than weekdays ?

It looks our hypothesis is incorrect in context of the current data. Most of the videos were published on Friday, but Saturday and Sunday saw the least number of videos published.

4(d). Analyzing Text Variables

same as we have defined numerical and categorical list of variables, we are preparing a list for Text variables as well.

Do descriptive videos get more number of likes ?

Lets try to answer this question using the correlation heat maps.

Answer: So if you keep your title length short and description long, you are having higher chances of getting likes.

5. Defining the metric for Regression Model

RMSLE(Root Mean Squared Log Error)

Simply put RMSLE is the Root Mean Squared Error of the logarithm of actual and predicted values.

The Robustness of RMSLE to the outliers, the property of calculating the relative error between the Predicted and Actual Values, the most unique property of the RMLSE that it penalizes the underestimation of the actual value more severely than it does for the Overestimation.

Read More about RMSLE:

ASHRAE - Great Energy Predictor III

How much energy will a building consume?

www.kaggle.com

we will be using the below code snippet/function to check the RMSLE score against each of the models we train and try to bring it down as we move on.

6. Model building — Using Numerical+Categorical Variables

In this section we will discuss on adding just the numerical and categorical variables only.

Lets define all of them in below format:

Concatenating both Train and Test data inorder to apply the transformation on all required variables.

Numerical Variables: Log Transforming the numerical columns.

Categorical Variables: we will use , get_dummies() as it is used for data manipulation. It converts categorical data into dummy or indicator variables.

Splitting Train data into Train and Validation set and predicting on Test data

In the first step we will split the combined data again into Train and Test using the below code.

In this 2nd step, we are splitting the Train into Train and Validation set on 80:20 basis.

Just to check back if the categorical variables are properly manipulated to indicator or dummy variables.

we see that the category_id(in green) has being modified based on no of unique values, and country_code(in blue) also has being modified based on no of unique values within it as shown in picture.

In the above step we are collecting the features from X_trn and removing the features which are not text variables and date variables.

Building model using Linear Regression and predicting the no of likes using just numerical and categorical columns.

After we fit the X_trn(train), we are predicting on X_val(validation dataset), and then using the rmsle function we are checking the score which is 873.06

And then we are using the same model to predict on the test data as shown below and downloading the file.

Checking the result set(Base Model), to see how many likes were predicted against each of the video_id. We see that for 1st one - 619 likes, for 2nd one — 2248 likes, etc.

7. Model building — Using Numerical+Categorical+Date Variables

Now, lets do some feature engineering using Date variables and them to the model and predict.

So based on Publish date , we will extract the day of week, month and year.

And then adding all the variables except for TEXT variables and predict.

Here this time we will use the Gradient Boosting Regressor to predict and also use the best hyper parameters after the tuning.

Below code snippet is shown how to use, but the internal code is provided in GITHUB Link.

So the function which is being used here also provides us with Top 20 most useful features which help in predicting the Target Labels.

So as you can see this time the RMSLE score came down to 665.08 which is a good improvement.

8. Model building — Using Numerical+Categorical+Date+Text Variables

Finally lets add the Text Variables also and predict.

We will use the most basic method which is BOW(Bag of Words) method on all Textual variables. Below example uses the Textual Variable — ‘description’ for BOW approach on it.