Multi-Class Classification of Research Articles using NLP Methods
Researchers have access to large online archives of scientific articles. As a consequence, finding relevant articles has become more and more difficult. Tagging or topic modelling provides a way to give clear token of identification to research articles which facilitates recommendation and search process.
The Problem Statement here is :
Given the abstracts for a set of research articles, predict the tags for each article included in the test set. Note that a research article can possibly have multiple tags. The research article abstracts are sourced from the following 4 topics:
1. Computer Science
2. Mathematics
3. Physics
4. Statistics
List of possible tags are as follows:
[Analysis of PDEs, Applications, Artificial Intelligence,Astrophysics of Galaxies, Computation and Language, Computer Vision and Pattern Recognition, Cosmology and Nongalactic Astrophysics, ..etc.(Total 25 tags)]
Table of Contents
- Difference between Multi-Class v/s Multi-Label Classification
- EDA
- Dataset Shape
- Target Distribution
- Model Building — BOW Approach
- Splitting Strategy, One-vs-rest approach, Validating the model
- TFIDF Approach
- Data Cleansing Strategy
- TFIDF Approach on cleansed data
1. Firstly lets try to understand what is the difference between Multi-Class Classification and Multi-Label Classification:
Lets us take the example here that we have 3 classes for our dataset. For a multi-class classification, the record containing some sample data can be defined as a ‘Sunny’ Day, or as a ‘Rainy’ Day or as a ‘Cloudy’ Day. So here we see that all the classes are mutually exclusive. This represents a Multi-class Classification Problem.
For a Multi-Label Classification, the sample dataset may contain 2 or more categories. It can be a ‘Sunny’ Day followed by ‘Rain’ on the same day, then this category falls under Multi-Label Classification as shown in figure. They are not mutually exclusive and can occur more than one.
Train Data Dictionary
Test Data Dictionary
Tags to be predicted would be as follows
2. EDA
So we start here with Exploratory Data Analysis. A brief look at the data for Train and Test is as follows.
As you can see in test data we only see the Topics, but we do not see the actual tags to which the records/abstract column information belongs to. This is our task to predict the correct label based on Abstract text given to us against each of the record just as we have seen in Train data.
Unlike binary classification, where we have only 2 classes either 0 or 1 to predict a positive class or negative class. We are dealing with multi-class classification problem here.
3. Dataset Shape
Next step would be the check the shape of data and find the variable and label information.
From the above code we see that there are 6 variables and 25 sub-topics/tags or labels which needs to be predicted.
4. Target Distribution
We will take the Train data and calculate the Distribution of data against each of the Sub-Topics. Upon verification and analysis we found that out of 25 Sub-Topics, Machine Learning related articles constitute about 27% of data. Next comes Artificial Intelligence with 9.8% and so on..
4(a). Null Value checks
Ideally we want to have a look at the percentage of null values in each variable.
ADDITIONAL DESCRIPTIVE EDA
In addition to existing information, we would like to know how the topics are being distributed here. below is the picture showing the counts where we see Computer science dominates over other fields.
VISUAL EDA THROUGH WORDCLOUDS
Using the wordcloud lets see what are the Top Words for a given Sub-Topic.
The code for the below is shown here : Github
So as you can see words like ‘neural’,’networks’, ‘deep’ all belong to ‘Machine Learning’ Sub-topic and are frequently being occurred in the data set. Similarly ‘robot’,’learning’ etc are frequently being occurred in the Sub-topic ‘Artificial Intelligence’
Model Building
5. Bag of Words Approach
A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling. Here we have a column/Variable ‘ABSTRACT’ which has sentences[Text]. These Sentences have numerous number of words.
Text: Collection of Words
Word: Adds some kind of meaning to the sentence.
We make a bag containing all the words in our dataset.
Features : How many times is a particular word from the bag is present in our sentence
6(a). Splitting the data between Train,validation and Test
we will split the data into 80/20 ratio (Train/Test), and then using CountVectorizer we create BOW features.
6(b). One-vs-the-rest (OvR) multiclass/multilabel strategy
Also known as one-vs-all, this strategy consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes. One advantage of this approach is its interpretability. Since each class is represented by one and one classifier only, it is possible to gain knowledge about the class by inspecting its corresponding classifier.
This is the most commonly used strategy for multiclass classification and is a fair default choice.
As part of metric, we have taken ‘f1_score’ and getting the score of 0.623 based on validation data set. To read more on f1_score, please refer to : https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
6(c). Testing theResults by taking few sample records
Once we got the score , we predict using the test data
Then we write this data into another dataframe[ss_pred] to check for the results. In the below figure we have taken sample[4 and 5]
As per the BOW model, sentence 4 belongs to ‘Astrophysics of Galaxies’ and sentence 5 belongs to ‘Artificial Intelligence’, as you can see that they are marked with 1 and all others are 0.
Lets see the complete sentence 4 and sentence 5 to convince ourselves more.
Looking at the sentence, we are convinced that it was able to tag proper sub-topics.
7. TF-IDF Approach
TF-IDF = Term-Frequency * Inverse Document Frequency
Is another approach apart from BOW approach which can be used as an NLP method
Term-Frequency = (Frequency of the word in the sentence) / (Total number of words in the sentence)
Inverse Document Frequency = (Total number of sentences)/(Number of sentences containing the word)
We will use the below sklearns model for TFIDF Vectorization
So here the score we get is ‘0.630’ which is slightly better than BOW approach.
8. Text Cleaning of Abstract Data
In text cleaning, we are using below 3 steps to cleanse our data
Step 1: De-contracting Phrases
Here We are defining a function, where it searches for words like “won’t” and would replace it with “will not”, and “can’t” with “cannot”, etc.
Step 2: We have taken a list of stopwords [‘a’, ‘an’, ‘the’, etc.]
Step 3: We are defining another function which takes the Abstract sentence as an input and de-contracting phrase function will be called here to decontract the respective words and removes any characters which do not belong to Alpha-Numeric [^A-Za-z0–9].
Sample 2 records before Cleansing:
Sample 2 records after cleansing:
As you can see that most of the un-necessary words have been removed and few have being cleansed.
9. Applying TFIDF Vectorization on Cleansed Data
Here we are using the cleansed abstract data, and along with it we are also adding the 4 Topics as well to Predict the Label.
After fitting the transform we are predicting based on cleansed data.
We see that the score has increased to 0.74 which is a big leap.
Finally we summarize as below with all the approaches :
The full code for this post can be found on Github. I look forward to hearing any feedback or comment.
For few other case studies please refer below URL: