Retail Sales Forecast — Time Series — Basic to Advance

Vishal K Singh
5 min readDec 2, 2021

--

Retail Sales Forecast

PROBLEM STATEMENT :

A Leading nutrition and supplement retail chain offers a comprehensive range of products for all your wellness and fitness needs. It follows a multi-channel distribution strategy with 350+ retail stores spread across 100+ cities.

Effective forecasting for store sales gives essential insights into upcoming cash flow, meaning the Retail company can more accurately plan the cashflow at store level.

Sales data for 18 months from 365 stores is available along with information on Store Type, Location Type for each store, Region Code for every store, Discount provided by the store on every day, Number of Orders everyday.

OBJECTIVE :

1) To predict the store sales for each store for the next two months.

2) Need to build Time series forecasting models based on past sales and several other categorical features.

TABLE OF CONTENTS:

  1. Data Description
  2. EDA
  3. Feature Engineering
  4. Building models and experimentation
  5. Final comparison and conclusion of the models

1. Data Description

Dataset Information
  1. We can see that there are about 8 variables which would need data exploration and multiple features can be generated and later be used for modelling.

2. Variable ‘Sales’ is Target/y column which we need to predict.

2. Exploratory Data Analysis:

Before going to any kind of modelling, we will always want to have a look at the kind of data that we have.

We have been provided a file, with information:

Train.csv: We will use this file for training our model. It contains variables or features that we will input to our model, and the target variable that we want to predict.

Now lets go ahead and check the data we have.

Train Data
Test Data

Lets also check the dataset shape.

Plotting Numerical and Categorical Features:

  • Univariate Analysis

Visually we are able to see communicate things more clearly and graphs help us do that. Lets visualize the data distribution in countplots.

  1. Numerical Features Histplot
Histplot for Numerical Features

2. Categorical Features Countplot

Countplots for categorical variables — Store Type, Location Type, Region Code
Countplots for categorical variables — Holiday , Discount

HYPOTHESIS GENERATION

Simply put, a hypothesis is a possible view or assertion of an analyst about the problem he or she is working upon. It may be true or may not be true.

  1. Will the Store_id play a major role in predicting the sales for next 2 months?
  2. Will the Holiday play a major role in predicting?
  3. Does the no of Orders help in forecasting sales for next 2 months?
  4. Does Store Type, Location Type, Region code impact/contribute towards the target sales prediction?
  5. Does the discount impact the sales prediction?
  • Bi-variate Analysis — Sales v/s other features

Lets perform Bi-variate analysis using the target sales for these variables one by one.

we can infer that:

  1. Region code R1 and R3 have slightly more sales than R2 and R4
  2. Location Type L2 and L1 have more sales than other location Types
  3. Store Type S4 and S3 have more sales than S1 and S2

Now lets add more feature using date variables and see how the sales impact based on certain dates.

We see that using the date variable we can derive many more features like adding day of week, month, weekend/weekday etc. which may be helpful in predicting/forecasting the sales as per the need.

So once we have added these new features lets perform the analysis on these new features as well to see how well they can impact the Sales.

weekday,weekend,monthly, daily wise sales plots

we can infer that:

  1. During weekend we see that more sales happen.
  2. During Monthly sales we see it is more during holiday season like December.
  3. In a Month schedule we see that first 5 days have more sales than other days.

TARGET DISTRIBUTION

Lets plot the target sales and see the distribution pattern.

we see that the distribution is right skewed and we would like to know more using percentile distribution.

So from the above percentiles we see that upto 90th percentile it is 66282, and upto 99th percentile it is 102159 and 100th percentile is 247K which is more than the expected value.

So we could consider anything outside of 99th percentile as outliers and consider only from 01st percentile to 99th percentile which would help the model.

And now using this data lets see the plot of sales distribution if it helps.

Lets take the Logarithm scale of the sales and see the distribution.

we see that the distribution is nicely forming a bell curve or a normal (gaussian) distribution.

3. Model Build and Model Evaluation

Now that we have added good number of features lets proceed with:

  1. encodings of categorical variables using pandas pkg.
  2. splitting the data between train and test.
  3. Train the model using an Algorithm — Random Forest.

And then using the trained model lets predict and evaluate for the test data.

We need to perform below activities for test data on which we need to predict the sales for 2 months.

  1. create weekday/weekend feature from the test data.
  2. encoding using pandas pkg for categorical variables to fit into the model.
  3. Drop columns such as ID, Date as they are not required for model.

As you can see in the below screenshot we are able to predict the sales for each store record.

Sales Predicted

The full code for this post can be found on Github. I look forward to hearing any feedback or comment.

For few other case studies please refer below URL:

https://vishal-aiml164.github.io/vishal_aiml_portfolio/

--

--

Vishal K Singh
Vishal K Singh

Written by Vishal K Singh

Data Science and Machine Learning Practitioner | Python | AzureML | NLP | Classification | Regression | Clustering | Neural Networks | Flask | Rest APIs | DB

No responses yet