Multi-Class Classification of Research Articles using NLP Methods

8 min readOct 24, 2020

Image taken from unsplash.com and modified

Researchers have access to large online archives of scientific articles. As a consequence, finding relevant articles has become more and more difficult. Tagging or topic modelling provides a way to give clear token of identification to research articles which facilitates recommendation and search process.

The Problem Statement here is :

Given the abstracts for a set of research articles, predict the tags for each article included in the test set. Note that a research article can possibly have multiple tags. The research article abstracts are sourced from the following 4 topics:

1. Computer Science

2. Mathematics

3. Physics

4. Statistics

List of possible tags are as follows:

[Analysis of PDEs, Applications, Artificial Intelligence,Astrophysics of Galaxies, Computation and Language, Computer Vision and Pattern Recognition, Cosmology and Nongalactic Astrophysics, ..etc.(Total 25 tags)]

Difference between Multi-Class v/s Multi-Label Classification
EDA
Dataset Shape
Target Distribution
Model Building — BOW Approach
Splitting Strategy, One-vs-rest approach, Validating the model
TFIDF Approach
Data Cleansing Strategy
TFIDF Approach on cleansed data

1. Firstly lets try to understand what is the difference between Multi-Class Classification and Multi-Label Classification:

Lets us take the example here that we have 3 classes for our dataset. For a multi-class classification, the record containing some sample data can be defined as a ‘Sunny’ Day, or as a ‘Rainy’ Day or as a ‘Cloudy’ Day. So here we see that all the classes are mutually exclusive. This represents a Multi-class Classification Problem.

For a Multi-Label Classification, the sample dataset may contain 2 or more categories. It can be a ‘Sunny’ Day followed by ‘Rain’ on the same day, then this category falls under Multi-Label Classification as shown in figure. They are not mutually exclusive and can occur more than one.

Train Data Dictionary

Test Data Dictionary

Tags to be predicted would be as follows

2. EDA

So we start here with Exploratory Data Analysis. A brief look at the data for Train and Test is as follows.

Test data sample+to be predicted labels box in red color

As you can see in test data we only see the Topics, but we do not see the actual tags to which the records/abstract column information belongs to. This is our task to predict the correct label based on Abstract text given to us against each of the record just as we have seen in Train data.

Unlike binary classification, where we have only 2 classes either 0 or 1 to predict a positive class or negative class. We are dealing with multi-class classification problem here.

3. Dataset Shape

Next step would be the check the shape of data and find the variable and label information.

Python Code to check the shape of the data

From the above code we see that there are 6 variables and 25 sub-topics/tags or labels which needs to be predicted.

4. Target Distribution

We will take the Train data and calculate the Distribution of data against each of the Sub-Topics. Upon verification and analysis we found that out of 25 Sub-Topics, Machine Learning related articles constitute about 27% of data. Next comes Artificial Intelligence with 9.8% and so on..

Target Distribution for Multi-Class Classification (Train Data)

4(a). Null Value checks

Ideally we want to have a look at the percentage of null values in each variable.

ADDITIONAL DESCRIPTIVE EDA

In addition to existing information, we would like to know how the topics are being distributed here. below is the picture showing the counts where we see Computer science dominates over other fields.

VISUAL EDA THROUGH WORDCLOUDS

Using the wordcloud lets see what are the Top Words for a given Sub-Topic.

The code for the below is shown here : Github

So as you can see words like ‘neural’,’networks’, ‘deep’ all belong to ‘Machine Learning’ Sub-topic and are frequently being occurred in the data set. Similarly ‘robot’,’learning’ etc are frequently being occurred in the Sub-topic ‘Artificial Intelligence’

Model Building

5. Bag of Words Approach

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling. Here we have a column/Variable ‘ABSTRACT’ which has sentences[Text]. These Sentences have numerous number of words.

Text: Collection of Words

Word: Adds some kind of meaning to the sentence.

We make a bag containing all the words in our dataset.

Features : How many times is a particular word from the bag is present in our sentence

6(a). Splitting the data between Train,validation and Test

we will split the data into 80/20 ratio (Train/Test), and then using CountVectorizer we create BOW features.

6(b). One-vs-the-rest (OvR) multiclass/multilabel strategy

Also known as one-vs-all, this strategy consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes. One advantage of this approach is its interpretability. Since each class is represented by one and one classifier only, it is possible to gain knowledge about the class by inspecting its corresponding classifier.

This is the most commonly used strategy for multiclass classification and is a fair default choice.

using OneVsRestClassifer on Logistic Regression for Multi-Class Classification

As part of metric, we have taken ‘f1_score’ and getting the score of 0.623 based on validation data set. To read more on f1_score, please refer to : https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

6(c). Testing theResults by taking few sample records

Once we got the score , we predict using the test data

Then we write this data into another dataframe[ss_pred] to check for the results. In the below figure we have taken sample[4 and 5]

As per the BOW model, sentence 4 belongs to ‘Astrophysics of Galaxies’ and sentence 5 belongs to ‘Artificial Intelligence’, as you can see that they are marked with 1 and all others are 0.

Lets see the complete sentence 4 and sentence 5 to convince ourselves more.

Looking at the sentence, we are convinced that it was able to tag proper sub-topics.

7. TF-IDF Approach

TF-IDF = Term-Frequency * Inverse Document Frequency

Is another approach apart from BOW approach which can be used as an NLP method

Term-Frequency = (Frequency of the word in the sentence) / (Total number of words in the sentence)

Inverse Document Frequency = (Total number of sentences)/(Number of sentences containing the word)

We will use the below sklearns model for TFIDF Vectorization

So here the score we get is ‘0.630’ which is slightly better than BOW approach.

8. Text Cleaning of Abstract Data

In text cleaning, we are using below 3 steps to cleanse our data

Step 1: De-contracting Phrases

Here We are defining a function, where it searches for words like “won’t” and would replace it with “will not”, and “can’t” with “cannot”, etc.

Step 2: We have taken a list of stopwords [‘a’, ‘an’, ‘the’, etc.]

Step 3: We are defining another function which takes the Abstract sentence as an input and de-contracting phrase function will be called here to decontract the respective words and removes any characters which do not belong to Alpha-Numeric [^A-Za-z0–9].

Sample 2 records before Cleansing: