The aim of this work is to analyze a dataset of purchases in an anonymous online store. The analysis will consist of data cleaning, exploratory data analysis (EDA), a simple case of linear regression, a more complete study of multiple linear regression and finally a binary classification problem.
# Installs
#install.packages("caret")
#install.packages("car")
#install.packages("ggplot2")
#install.packages("devtools")
#install.packages("ggcorrplot")
#devtools::install_github("laresbernardo/lares")
#devtools::install_github("ProcessMiner/nlcor")
#install.packages("ISLR2")
#devtools::install_github("kassambara/ggpubr")
#install.packages("zoo")
#install.packages("ROSE")
#install.packages("e1071")
#install.packages("klaR")
#install.packages("treemapify")
#install.packages("glmnet")
# Libraries
library(knitr)
library(tidyverse)
library(ggplot2)
library(forcats)
library(dplyr)
library(ggcorrplot)
library(lares)
library(treemapify)
library(MASS)
library(devtools)
library(ISLR2)
library(ggpubr)
library(zoo)
library(reshape2)
library(ROSE) # for undersampling/oversampling
library(e1071) # for naive Bayes
library(caret)
library(klaR)
library(car) # for calculation of VIF in multiple linear regression
library(glmnet)
library(choroplethr)
library(choroplethrMaps)
library(kableExtra)
library(magrittr)
The dataset has been obtained from Kaggle, the famous data science platform. It is publicly available for educational purposes and has been downloaded as csv file for ease of use. It consists of 9994 purchases made from an anonymous online store and contains the following features:
Row ID => Unique ID for each row.
Order ID => Unique Order ID for the purchase.
Order Date => Order Date of the product.
Ship Date => Shipping Date of the product.
Ship Mode=> Shipping Mode specified by the customer.
Customer ID => Unique ID to identify the customer.
Customer Name => Name of the customer.
Segment => The segment to which the customer belongs.
Country => Country where the purchase was made.
City => City where the purchase was made.
State => State where the purchase was made.
Postal Code => Postal Code of the customer.
Region => Region where the customer belongs.
Product ID => Unique ID of the product.
Category => Category of the product ordered.
Sub-Category => Sub-Category of the product ordered.
Product Name => Name of the product.
Sales => Sales of the product.
Quantity => Quantity of the product.
Discount => Discount provided.
Profit => Profit/Loss incurred.
# Read the dataset
df <- read.csv("~/Luis/Universidad/Master_DS/Second_semester/Statistical_Learning/Project/Sample-Superstore.csv",header=TRUE)
#kable(df[1:5,])
df[1:15,] %>%
kable(format = "html", col.names = colnames(df)) %>%
column_spec(c(2:21), width_min = "0.8in") %>%
kable_styling() %>%
kableExtra::scroll_box(width = "100%", height = "300px")
Row.ID | Order.ID | Order.Date | Ship.Date | Ship.Mode | Customer.ID | Customer.Name | Segment | Country | City | State | Postal.Code | Region | Product.ID | Category | Sub.Category | Product.Name | Sales | Quantity | Discount | Profit |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | CA-2016-152156 | 11/8/2016 | 11/11/2016 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | Kentucky | 42420 | South | FUR-BO-10001798 | Furniture | Bookcases | Bush Somerset Collection Bookcase | 261.9600 | 2 | 0.00 | 41.9136 |
2 | CA-2016-152156 | 11/8/2016 | 11/11/2016 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | Kentucky | 42420 | South | FUR-CH-10000454 | Furniture | Chairs | Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back | 731.9400 | 3 | 0.00 | 219.5820 |
3 | CA-2016-138688 | 6/12/2016 | 6/16/2016 | Second Class | DV-13045 | Darrin Van Huff | Corporate | United States | Los Angeles | California | 90036 | West | OFF-LA-10000240 | Office Supplies | Labels | Self-Adhesive Address Labels for Typewriters by Universal | 14.6200 | 2 | 0.00 | 6.8714 |
4 | US-2015-108966 | 10/11/2015 | 10/18/2015 | Standard Class | SO-20335 | Sean O’Donnell | Consumer | United States | Fort Lauderdale | Florida | 33311 | South | FUR-TA-10000577 | Furniture | Tables | Bretford CR4500 Series Slim Rectangular Table | 957.5775 | 5 | 0.45 | -383.0310 |
5 | US-2015-108966 | 10/11/2015 | 10/18/2015 | Standard Class | SO-20335 | Sean O’Donnell | Consumer | United States | Fort Lauderdale | Florida | 33311 | South | OFF-ST-10000760 | Office Supplies | Storage | Eldon Fold ’N Roll Cart System | 22.3680 | 2 | 0.20 | 2.5164 |
6 | CA-2014-115812 | 6/9/2014 | 6/14/2014 | Standard Class | BH-11710 | Brosina Hoffman | Consumer | United States | Los Angeles | California | 90032 | West | FUR-FU-10001487 | Furniture | Furnishings | Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood | 48.8600 | 7 | 0.00 | 14.1694 |
7 | CA-2014-115812 | 6/9/2014 | 6/14/2014 | Standard Class | BH-11710 | Brosina Hoffman | Consumer | United States | Los Angeles | California | 90032 | West | OFF-AR-10002833 | Office Supplies | Art | Newell 322 | 7.2800 | 4 | 0.00 | 1.9656 |
8 | CA-2014-115812 | 6/9/2014 | 6/14/2014 | Standard Class | BH-11710 | Brosina Hoffman | Consumer | United States | Los Angeles | California | 90032 | West | TEC-PH-10002275 | Technology | Phones | Mitel 5320 IP Phone VoIP phone | 907.1520 | 6 | 0.20 | 90.7152 |
9 | CA-2014-115812 | 6/9/2014 | 6/14/2014 | Standard Class | BH-11710 | Brosina Hoffman | Consumer | United States | Los Angeles | California | 90032 | West | OFF-BI-10003910 | Office Supplies | Binders | DXL Angle-View Binders with Locking Rings by Samsill | 18.5040 | 3 | 0.20 | 5.7825 |
10 | CA-2014-115812 | 6/9/2014 | 6/14/2014 | Standard Class | BH-11710 | Brosina Hoffman | Consumer | United States | Los Angeles | California | 90032 | West | OFF-AP-10002892 | Office Supplies | Appliances | Belkin F5C206VTEL 6 Outlet Surge | 114.9000 | 5 | 0.00 | 34.4700 |
11 | CA-2014-115812 | 6/9/2014 | 6/14/2014 | Standard Class | BH-11710 | Brosina Hoffman | Consumer | United States | Los Angeles | California | 90032 | West | FUR-TA-10001539 | Furniture | Tables | Chromcraft Rectangular Conference Tables | 1706.1840 | 9 | 0.20 | 85.3092 |
12 | CA-2014-115812 | 6/9/2014 | 6/14/2014 | Standard Class | BH-11710 | Brosina Hoffman | Consumer | United States | Los Angeles | California | 90032 | West | TEC-PH-10002033 | Technology | Phones | Konftel 250 Conference<a0>phone<a0>- Charcoal black | 911.4240 | 4 | 0.20 | 68.3568 |
13 | CA-2017-114412 | 4/15/2017 | 4/20/2017 | Standard Class | AA-10480 | Andrew Allen | Consumer | United States | Concord | North Carolina | 28027 | South | OFF-PA-10002365 | Office Supplies | Paper | Xerox 1967 | 15.5520 | 3 | 0.20 | 5.4432 |
14 | CA-2016-161389 | 12/5/2016 | 12/10/2016 | Standard Class | IM-15070 | Irene Maddox | Consumer | United States | Seattle | Washington | 98103 | West | OFF-BI-10003656 | Office Supplies | Binders | Fellowes PB200 Plastic Comb Binding Machine | 407.9760 | 3 | 0.20 | 132.5922 |
15 | US-2015-118983 | 11/22/2015 | 11/26/2015 | Standard Class | HP-14815 | Harold Pawlan | Home Office | United States | Fort Worth | Texas | 76106 | Central | OFF-AP-10002311 | Office Supplies | Appliances | Holmes Replacement Filter for HEPA Air Cleaner, Very Large Room, HEPA Filter | 68.8100 | 5 | 0.80 | -123.8580 |
In this section we will carry out the cleaning of the data. First of all we will check basic information, such as missing values (in the form of NaN, NA and Null), duplicates, number of unique values and determine which columns can be removed and which can be modified or created. After, we will focus our efforts on determining outliers.
With a summary of the data we can know important information about each feature, such as its class, its mean, quartiles, maximum and minimum.
summary(df)
## Row.ID Order.ID Order.Date Ship.Date
## Min. : 1 Length:9994 Length:9994 Length:9994
## 1st Qu.:2499 Class :character Class :character Class :character
## Median :4998 Mode :character Mode :character Mode :character
## Mean :4998
## 3rd Qu.:7496
## Max. :9994
## Ship.Mode Customer.ID Customer.Name Segment
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Country City State Postal.Code
## Length:9994 Length:9994 Length:9994 Min. : 1040
## Class :character Class :character Class :character 1st Qu.:23223
## Mode :character Mode :character Mode :character Median :56431
## Mean :55190
## 3rd Qu.:90008
## Max. :99301
## Region Product.ID Category Sub.Category
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Product.Name Sales Quantity Discount
## Length:9994 Min. : 0.444 Min. : 1.00 Min. :0.0000
## Class :character 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000
## Mode :character Median : 54.490 Median : 3.00 Median :0.2000
## Mean : 229.858 Mean : 3.79 Mean :0.1562
## 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000
## Max. :22638.480 Max. :14.00 Max. :0.8000
## Profit
## Min. :-6599.978
## 1st Qu.: 1.729
## Median : 8.666
## Mean : 28.657
## 3rd Qu.: 29.364
## Max. : 8399.976
Check if there are Na and NAN:
is.null(df)
## [1] FALSE
sum(is.na(df))
## [1] 0
sum(is.nan(as.matrix(df)))
## [1] 0
There are no missing values of any type. This is certainly not very surprising since most datasets available in Kaggle for educational use have usually gone through some sort of cleaning beforehand. Therefore, this means that for now we still have the 9994 examples available.
Number of duplicates:
sum(duplicated(df))
## [1] 0
No duplicates present. Same reasoning as for missing values apply.
Let’s have a look at the unique values in each column.
cat(paste("Sub.Category: ", length(unique(df[["Sub.Category"]])) ), sep="\n")
## Sub.Category: 17
cat(paste("Category: ", length(unique(df[["Category"]]))), sep="\n")
## Category: 3
cat(paste("Country: ", length(unique(df[["Country"]]))), sep="\n")
## Country: 1
cat(paste("Region: ", length(unique(df[["Region"]]))), sep="\n")
## Region: 4
cat(paste("State: ", length(unique(df[["State"]]))), sep="\n")
## State: 49
cat(paste("City: ", length(unique(df[["City"]]))), sep="\n")
## City: 531
cat(paste("Ship.Mode: ", length(unique(df[["Ship.Mode"]]))), sep="\n")
## Ship.Mode: 4
cat(paste("Customer.ID: ", length(unique(df[["Customer.ID"]]))), sep="\n")
## Customer.ID: 793
cat(paste("Order.ID: ", length(unique(df[["Order.ID"]]))), sep="\n")
## Order.ID: 5009
It’s shown that there are 17 different sub-categories grouped in 3 categories, 532 cities belonging to 49 states and 4 regions, 4 ship modes and 793 different customers. Having 9994 orders and 793 customers means a good quantity of those customers (if not all) have purchased more than one subcategory or made more than one order. Different subcategory products bought together will be found in different rows of the dataset, even though they share the same order number. This conclusion can be drawn from the number of unique orders, which is slightly bigger than half of the number of purchases and the fact that there are no duplicates . This is all good information that we can use later in the Exploratory Data Analysis section.
We can get rid of some columns. The name of the customer is not very useful when we already have a unique customer ID for each client. This is also a good practice to maintain the anonymity of customers. The Country column is also not relevant since there is only data for the USA. And finally the postal code column won’t be needed since we will not carry our analysis in such detail and with the city and state columns it will be enough.
df = df[-c(1,7,9,12)]
Convert date columns to date format:
df$Order.Date <- as.Date(df$Order.Date, format = "%m/%d/%Y")
df$Ship.Date <- as.Date(df$Ship.Date, format = "%m/%d/%Y")
Next, we can create some useful columns. For example, if we subtract the Order.Date from the Ship.Date we get the Processing.Time, this is, how much time the store needed to ship the product since it was ordered. The gross margin is also useful to have, it is calculated as the ratio of Profit to Sales multiplied by 100 (in %). Finally, we can separate the Order.Date column into three other columns for the year, the month and the information concerning the year and month.
df$Order.Year <- format(df$Order.Date,"%Y")
df$Order.Year <- as.numeric(df$Order.Year)
df$Order.Month <- format(df$Order.Date,"%m")
df$Order.Month <- as.numeric(df$Order.Month)
df$Order.Year_month <- format(df$Order.Date,"%Y-%m")
df$Order.Year_month <- as.Date(paste(df$Order.Year_month,"-15",sep=""))
df$Order.Date <- as.numeric(df$Order.Date)
df$Processing.Time <- df$Ship.Date - df$Order.Date
df$Processing.Time <- as.numeric(df$Processing.Time)
df$Gross.Margin <- (df$Profit/df$Sales)*100
Let’s have a look at the final structure of the data after doing the aforementioned modifications.
summary(df)
## Order.ID Order.Date Ship.Date Ship.Mode
## Length:9994 Min. :16073 Min. :2014-01-07 Length:9994
## Class :character 1st Qu.:16578 1st Qu.:2015-05-27 Class :character
## Mode :character Median :16978 Median :2016-06-29 Mode :character
## Mean :16921 Mean :2016-05-03
## 3rd Qu.:17300 3rd Qu.:2017-05-18
## Max. :17530 Max. :2018-01-05
## Customer.ID Segment City State
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Region Product.ID Category Sub.Category
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Product.Name Sales Quantity Discount
## Length:9994 Min. : 0.444 Min. : 1.00 Min. :0.0000
## Class :character 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000
## Mode :character Median : 54.490 Median : 3.00 Median :0.2000
## Mean : 229.858 Mean : 3.79 Mean :0.1562
## 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000
## Max. :22638.480 Max. :14.00 Max. :0.8000
## Profit Order.Year Order.Month Order.Year_month
## Min. :-6599.978 Min. :2014 Min. : 1.00 Min. :2014-01-15
## 1st Qu.: 1.729 1st Qu.:2015 1st Qu.: 5.00 1st Qu.:2015-05-15
## Median : 8.666 Median :2016 Median : 9.00 Median :2016-06-15
## Mean : 28.657 Mean :2016 Mean : 7.81 Mean :2016-04-29
## 3rd Qu.: 29.364 3rd Qu.:2017 3rd Qu.:11.00 3rd Qu.:2017-05-15
## Max. : 8399.976 Max. :2017 Max. :12.00 Max. :2017-12-15
## Processing.Time Gross.Margin
## Min. :0.000 Min. :-275.00
## 1st Qu.:3.000 1st Qu.: 7.50
## Median :4.000 Median : 27.00
## Mean :3.958 Mean : 12.03
## 3rd Qu.:5.000 3rd Qu.: 36.25
## Max. :7.000 Max. : 50.00
Let’s have a look again at the summary of columns that could potentially have some outliers.
summary(df$Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.444 17.280 54.490 229.858 209.940 22638.480
summary(df$Quantity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 3.00 3.79 5.00 14.00
summary(df$Discount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.2000 0.1562 0.2000 0.8000
summary(df$Profit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6599.978 1.729 8.666 28.657 29.364 8399.976
summary(df$Processing.Time)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 4.000 3.958 5.000 7.000
summary(df$Order.Year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2014 2015 2016 2016 2017 2017
summary(df$Gross.Margin)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -275.00 7.50 27.00 12.03 36.25 50.00
summary(df$Order.Date)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16073 16578 16978 16921 17300 17530
By checking the maximum and minimum values it is hard to tell if there could be outliers, since, even though in some cases the maxima are a bit far from the 3rd quartile, they are all possible values for the quantities they describe.
Let’s plot the numerical variables to have a look at their distribution.
hst_Sales <- ggplot(df) + aes(x = Sales) +
geom_histogram(color="white", fill="#80b1d3", bins = 100) +
labs(title="", x="Sales", y="Frequency") +
theme_minimal()+
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28), title = element_text(size = 16))
hst_Quantity <- ggplot(df) + aes(x = Quantity) +
geom_histogram(color="white", fill="#80b1d3", bins = 14) +
labs(title="", x="Quantity", y="Frequency") +
theme_minimal()+
theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
title = element_text(size = 16))
hst_Discount <- ggplot(df) + aes(x = Discount) +
geom_histogram(color="white", fill="#80b1d3", bins = 9) +
labs(title="", x="Discount", y="Frequency") +
theme_minimal()+
theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
title = element_text(size = 16))
hst_Profit <- ggplot(df) + aes(x = Profit) +
geom_histogram(color="white", fill="#80b1d3", bins = 120) +
labs(title="", x="Profit", y="Frequency") +
theme_minimal()+
theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
title = element_text(size = 16))
hst_Processing <- ggplot(df) + aes(x = Processing.Time) +
geom_histogram(color="white", fill="#80b1d3", bins = 8) +
labs(title="", x="Processing time", y="Frequency") +
theme_minimal()+
theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
title = element_text(size = 16))
hst_Year <- ggplot(df) + aes(x = Order.Year) +
geom_histogram(color="white", fill="#80b1d3", bins = 4) +
labs(title="", x="Years", y="Frequency") +
theme_minimal()+
theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
title = element_text(size = 16))
################################################################################
require(gridExtra)
grid.arrange(hst_Sales, hst_Quantity, hst_Year, hst_Discount, hst_Processing, hst_Profit, ncol = 2, nrow = 3)
Processing time, Years, Discount and Quantity look fine. The latter presents some right skewness but for our criteria is not a case of possible outliers. The Sales and profit plots require some transformation before being able to draw a conclusion. A log transformation should do the work.
df_positive_profit <- df %>% filter(Profit >= 0)
df_negative_profit <- df %>% filter(Profit < 0)
plot1 <- ggplot(df_positive_profit) + aes(x = log(Profit)) +
geom_histogram(color="white", fill="#80b1d3", bins = 12) +
labs(title="", x="Log(Pos. Profit)", y="Frequency") +
theme_minimal()+
theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28), title = element_text(size = 16))
plot2 <- ggplot(df_negative_profit) + aes(x = log(-Profit)) +
geom_histogram(color="white", fill="#80b1d3", bins = 12) +
labs(title="", x="Log(Neg. Profit)", y="Frequency") +
theme_minimal()+
theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28), title = element_text(size = 16))
plot3 <- ggplot(df) + aes(x = log(Sales)) +
geom_histogram(color="white", fill="#80b1d3", bins = 30) +
labs(title="", x="log(Sales)", y="Frequency") +
theme_minimal()+
theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28), title = element_text(size = 16))
df_profit <- df[!(df$Profit < 600 & df$Profit >= -600 ),]
plot4 <- ggplot(df_profit) + aes(x = Profit) +
geom_histogram(color="white", fill="#80b1d3") +
labs(title="", x="Profit (omitting central values)", y="Frequency") +
theme_minimal()+
theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28), title = element_text(size = 16))
################################################################################
require(gridExtra)
grid.arrange(plot1, plot2, plot3, plot4, ncol = 2, nrow = 2)
The log of Sales looks pretty Gaussian to the eye and the skewness is gone. The variable Profit presents both negative and positive values, therefore we separate them both in two histograms, taking the absolute value of the negative one. We also show the Profit distribution without any transformation but hiding the central highest values so the the tails are easier to see. There seem to be no strong indications of the presence of outliers for any of the quantities.
Finally, we can check the box plots of all numeric variables. Box plots are one of the most useful tools in outliers identification.
plot1 <- ggplot(df, aes(y = Quantity)) +
geom_boxplot(color="black", fill="#80b1d3") + theme_minimal() +
theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28), title = element_text(size = 16))
plot2 <- ggplot(df, aes(y = Discount)) +
geom_boxplot(color="black", fill="#80b1d3") + theme_minimal() +
theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28), title = element_text(size = 16))
plot3 <- ggplot(df, aes(y = Sales)) +
geom_boxplot(color="black", fill="#80b1d3") + theme_minimal() +
theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28), title = element_text(size = 16))
plot4 <- ggplot(df, aes(y = Profit)) +
geom_boxplot(color="black", fill="#80b1d3") + theme_minimal() +
theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28), title = element_text(size = 16))
plot5 <- ggplot(df, aes(y = Processing.Time)) +
geom_boxplot(color="black", fill="#80b1d3") + theme_minimal() +
theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28), title = element_text(size = 16))
# Alternatively we can also separate the quantitative variable per the qualitative one, for example:
plot6 <- ggplot(df, aes(x = Category, y = Discount)) +
geom_boxplot(color="black", fill="#80b1d3") + theme_minimal() +
theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28), title = element_text(size = 16))
plot7 <- ggplot(df, aes(x = Region, y = Sales)) +
geom_boxplot(color="black", fill="#80b1d3") + theme_minimal() +
theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28), title = element_text(size = 16))
plot8 <- ggplot(df, aes(x = Category, y = Profit)) +
geom_boxplot(color="black", fill="#80b1d3") + theme_minimal() +
theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28), title = element_text(size = 16))
################################################################################
require(gridExtra)
grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, plot7, plot8, ncol = 4, nrow = 2)
Except for Processing Time, there are potential outliers in almost all variables. Sales and Profits stand out. Quantity and Discount present a few possible outliers. As seen at the beginning, the maximum discount corresponds to 0.8, which although rare, it is a value that can occur on specific occasions. The maximum value of Quantity is 14, which is also perfectly possible, especially in the case of items that are purchased in large quantities and also given that we have several types of customers. Those buying for their business or company would be expected to buy in larger quantities.
For these reasons and those stated above, we believe that we do not have enough information about the store and its functioning to be able to really determine if some of the purchases made can be classified as outliers. In the absence of a safe and definitive criterion, we chose not to eliminate any of the potential outlier points.
As a starting point to develop the following sections of the work, we will try to answer the following questions.
Which subcategories are more popular? Are those the ones that generate more profit?
Do different segments prefer different ship modes?
In which states are there more orders? And in which cities? Do they correspond to the ones that generate more profit?
Which is the top-10 clients according to purchases and to which segment do they belong? Are they the ones that generate more profit?
Do different segments buy some categories more than others?
Is there a difference between the profit generated by each segment in the different categories? And between the profit generated in each region in each category?
Does the quantity of products bought change in each category when there is a discount?
How does the profit change for different discount ranges? What would be a threshold discount for positive/negative profits?
How do orders behave throughout the year? Can we identify some patterns?
Do sales and profits behave similarly for every month?
Which segment is generating more sales? In which region?
Do we observe a different behavior for holiday periods (November and December in the USA)?
What is the relation between profit and sales? Is it linear?
How accurately can we estimate this relation?
How accurately can we predict future profits based on sales?
What features are more relevant when determining the profit?
Is there any interaction among the different features?
Is it possible to predict if a purchase will result on positive or negative profit? Which features are more relevant for this task? How accurate is the prediction?
summary(df)
## Order.ID Order.Date Ship.Date Ship.Mode
## Length:9994 Min. :16073 Min. :2014-01-07 Length:9994
## Class :character 1st Qu.:16578 1st Qu.:2015-05-27 Class :character
## Mode :character Median :16978 Median :2016-06-29 Mode :character
## Mean :16921 Mean :2016-05-03
## 3rd Qu.:17300 3rd Qu.:2017-05-18
## Max. :17530 Max. :2018-01-05
## Customer.ID Segment City State
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Region Product.ID Category Sub.Category
## Length:9994 Length:9994 Length:9994 Length:9994
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Product.Name Sales Quantity Discount
## Length:9994 Min. : 0.444 Min. : 1.00 Min. :0.0000
## Class :character 1st Qu.: 17.280 1st Qu.: 2.00 1st Qu.:0.0000
## Mode :character Median : 54.490 Median : 3.00 Median :0.2000
## Mean : 229.858 Mean : 3.79 Mean :0.1562
## 3rd Qu.: 209.940 3rd Qu.: 5.00 3rd Qu.:0.2000
## Max. :22638.480 Max. :14.00 Max. :0.8000
## Profit Order.Year Order.Month Order.Year_month
## Min. :-6599.978 Min. :2014 Min. : 1.00 Min. :2014-01-15
## 1st Qu.: 1.729 1st Qu.:2015 1st Qu.: 5.00 1st Qu.:2015-05-15
## Median : 8.666 Median :2016 Median : 9.00 Median :2016-06-15
## Mean : 28.657 Mean :2016 Mean : 7.81 Mean :2016-04-29
## 3rd Qu.: 29.364 3rd Qu.:2017 3rd Qu.:11.00 3rd Qu.:2017-05-15
## Max. : 8399.976 Max. :2017 Max. :12.00 Max. :2017-12-15
## Processing.Time Gross.Margin
## Min. :0.000 Min. :-275.00
## 1st Qu.:3.000 1st Qu.: 7.50
## Median :4.000 Median : 27.00
## Mean :3.958 Mean : 12.03
## 3rd Qu.:5.000 3rd Qu.: 36.25
## Max. :7.000 Max. : 50.00
First of all let’s see the proportion that each subcategory represents in general and inside each category. We can see that Binders, Paper, Furnishing and Phones (in that order) are the more popular subcategories. The less popular ones are Copiers and Machines, followed by Bookcases, Envelopes, Fasteners and Supplies. When talking about categories, Office Supplies is the most represented one. Both Technology and Furniture have a similar number of items.
plotdata <- df %>% count(Sub.Category) %>% arrange(desc(Sub.Category)) %>%
mutate(prop = round(n*100/sum(n), 1), lab.ypos = cumsum(prop) - 0.5*prop)
plotdata$label <- paste0(plotdata$Sub.Category, "\n", round(plotdata$prop), "%")
plot1 <- ggplot(plotdata, aes(x = "", y = prop, fill = Sub.Category)) +
geom_bar(width = 1, stat = "identity", color = "white") +
coord_polar(theta = "y", start = 1.4, direction = -1) +
theme(legend.position = "FALSE", axis.text=element_text(size=32))+
scale_y_continuous(breaks=cumsum(plotdata$prop) - plotdata$prop / 2,
labels= plotdata$label) + labs(y = "", x = "", title = "") + theme_minimal()+ theme(axis.text=element_text(size=24), axis.title.x = element_text(size =28), axis.title.y = element_text(size = 28), legend.text = element_text(size =24),
legend.title = element_text(size =28))
################################################################################
df_cat <- df %>% group_by(Category, Sub.Category) %>% summarize(count=n())
plot2 <- ggplot(df_cat, aes(area=count, label=Sub.Category, fill=Category, subgroup=Category))+
geom_treemap()+
geom_treemap_subgroup_border(colour = "white", size = 5)+
geom_treemap_subgroup_text(color='white', alpha=0.3, size=48,
place='center', fontface='bold')+
geom_treemap_text(color='white', place='center', size=34) +
theme(legend.text = element_text(size = 24),
legend.title = element_text(size = 28))
################################################################################
require(gridExtra)
grid.arrange(plot1, plot2, ncol = 2, nrow = 1)
Now we will construct a set of useful barplots to understand better some relations between the features in the dataset. First, we plot the top-15 states and cities where more orders are made. California, New York and Texas are the states that make the most orders. Most of these states belong to the top-15 most populated states in the USA for the years of interest. Regarding the top-15 cities, they are mostly, as expected, cities of the states where more orders have been made. Los Angeles and New York City are the top-2 since they are also the most populated cities in the USA. The most bought category for all states and cities is Office Supplies.
The most popular shipping mode is Standard Class for all segments, followed by Second Class and First Class. This is also true for each category. The amount of orders belonging to the Office Supplies category is more than the sum of the orders of the two other categories.
The south region presents the lowest amount of orders while the west region the most. Consumer is the type of client that more orders make for each region and Home Office the one that makes less.
Regarding the profit, East and West are the regions that more profit generate. Consumer type of clients represent the most profit but also the most losses.
top_states <- tail(names(sort(table(df$State))),15)
df_top_states <- df %>% filter(df$State %in% top_states)
plot1 <- ggplot(df_top_states, aes(x = State, fill = Category)) +
geom_bar(position = "stack", bins = 15) +
labs(title="", x="State", y="Orders") +
theme_minimal()+
theme(axis.text=element_text(size=24), axis.text.x = element_text(angle=45),
axis.title.x = element_text(size =28), axis.title.y=element_text(size=28),
plot.title = element_text(size = 24, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size =28)) +
scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))
################################################################################
top_cities <- tail(names(sort(table(df$City))),15)
df_top_cities <- df %>% filter(df$City %in% top_cities)
plot2 <- ggplot(df_top_cities, aes(x = City, fill = Category)) +
geom_bar(position = "stack", bins = 15) +
labs(title="", x = "City", y="Orders") +
theme_minimal() +
theme(axis.text=element_text(size=24), axis.text.x = element_text(angle=45),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size=28),
plot.title = element_text(size = 24, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28))+
scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))
################################################################################
df2 <- df %>% group_by(Category)
plot3 <- ggplot(data = df2,aes(Category, fill=`Ship.Mode`))+
geom_bar(position = 'stack')+
labs(title='', x='Category', y='Orders') +
theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 24, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size =28)) +
scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))
################################################################################
df2 <- df %>% group_by(Segment)
plot4 <- ggplot(data = df2, aes(Segment, fill=`Ship.Mode`))+
geom_bar(position = 'stack')+
labs(title='', x='Segment', y='Orders') +
theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 24, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28))+
scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))
################################################################################
df2 <-df %>% group_by(Region)
plot5 <- ggplot(data = df2, aes(Region, fill=`Segment`))+
geom_bar(position = 'stack')+
theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 24, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28))+
labs(title='', x='Region', y='Orders') +
scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))
################################################################################
plot6 <- ggplot(df, aes(x=Region, y=Profit, fill = Segment)) +
geom_bar(stat = "identity") + labs(x="Region", y="Profit") + theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 16, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))+
scale_y_continuous(label = scales::dollar)+
geom_hline(yintercept=0, linetype="dashed")
################################################################################
require(gridExtra)
grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, ncol = 2, nrow = 3)
In the next plots we represent the total number of orders, sales and profit generated in each month. During the last months of the year both sales and orders are higher than at the beginning. Up to three times more orders and sales are made in November and December as compared to January and February. Also, both positive and negative profits increase towards the last months of the year. The number of orders also increase throughout the years and the proportions of the shipping mode used remain the same.
# Purchases, sales and benefits for years and months. In what months do people usually buy more?
# Are the sales generally increasing over time? Are there more sales on Christmas, Black Friday...?
plot1 <- ggplot(df, aes(x=Order.Month, fill = Category))+
geom_bar(position = "stack", bins = 12)+
labs(title="", x = "Month", y="Orders")+
theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 24, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28))+
scale_x_discrete(limits= seq(1,12))+
scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))
################################################################################
plot2 <- ggplot(df, aes(x=Order.Month, y=Sales, fill = Category))+
geom_bar(stat = "identity", bins = 12) + labs(title="", x="Month", y="Sales") +
theme_minimal() + theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
plot.title = element_text(size = 16, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
scale_x_discrete(limits= seq(1,12)) + scale_y_continuous(label = scales::dollar) +
scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))
################################################################################
plot3 <- ggplot(df, aes(x=Order.Month, y= Profit, fill = Category))+
geom_bar(stat = "identity") + labs(title="", x="Month", y="Profit") +
theme_minimal() + theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
plot.title = element_text(size = 16, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
scale_x_discrete(limits= seq(1,12)) + scale_y_continuous(label = scales::dollar) +
scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))+
geom_hline(yintercept=0, linetype="dashed")
################################################################################
# Is there any preference on shipment methods over the years?
plot4 <- ggplot(df, aes(x=Order.Year, fill = Ship.Mode))+
geom_bar(position = "stack", bins = 4)+
labs(title="", x = "Year", y="Orders")+
theme_minimal()+
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 24, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))
require(gridExtra)
grid.arrange(plot2, plot1, plot3, plot4, ncol = 2, nrow = 2)
Next off, we study which subcategories represent the highest profits (sum of positive and negative) and which the highest losses (sum of positive and negative). Copiers belong to the top-5 subcategories that generate positive profit, but it is one of the least popular subcategories as we saw at the beginning of this section. Tables and Machines are subcategories that need to be closely watched given their losses are very high.
df_subcat <- df %>% group_by(Sub.Category) %>% summarise(prof = sum(Profit))
df_subcat <- df_subcat[order(df_subcat$prof),]
top_subcat_neg <- head(df_subcat$Sub.Category, 5)
top_subcat_pos <- tail(df_subcat$Sub.Category, 5)
df_top_pos <- df %>% filter(Sub.Category %in% top_subcat_pos)
plot_top_pos <- ggplot(df_top_pos, aes(x = Sub.Category, y = Profit, fill = Segment)) +
geom_bar(stat="identity", position = "stack", bins = 5) +
labs(x = "Subcategory", y= "Profit") +
theme_minimal() + scale_y_continuous(label = scales::dollar) +
theme(axis.text=element_text(size=24), axis.text.x = element_text(angle=45),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
plot.title = element_text(size = 38, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
ggtitle("Top subcategories with positive profit") +
scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))+
geom_hline(yintercept=0, linetype="dashed")
################################################################################
df_top_neg <- df %>% filter(Sub.Category %in% top_subcat_neg)
plot_top_neg <- ggplot(df_top_neg, aes(x = Sub.Category, y = Profit, fill = Segment)) +
geom_bar(stat = "identity", position = "stack", bins = 5) +
labs(x = "Subcategory", y= "Profit") + scale_y_continuous(label = scales::dollar) +
theme_minimal() +
theme(axis.text=element_text(size=24), axis.text.x = element_text(angle=45),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
plot.title = element_text(size = 38, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
ggtitle("Top subcategories with negative profit") +
scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))+
geom_hline(yintercept=0, linetype="dashed")
################################################################################
require(gridExtra)
grid.arrange(plot_top_pos, plot_top_neg, ncol = 2, nrow = 1)
In the map presented below we can see which of the states are associated with a highest number of orders. This information completes the one given earlier about the top-10 states regarding the number of orders. As expected, the most populated states are those where more orders are made: California, Texas, Florida, New York, Pennsylvania… Likewise, the most sparse are associated with a smaller number of purchases.
df_mapusa <- df %>% group_by(State)
df_mapusa <- df_mapusa %>% summarize(orders = n())
df_mapusa$region <- tolower(df_mapusa$State)
df_mapusa$value <- df_mapusa$orders
continental_us_states <- names(table(df_mapusa$region))
state_choropleth(df_mapusa, num_colors=9, zoom = continental_us_states) +
scale_fill_brewer(palette="YlOrBr") +
labs(fill = "Total orders")
We can now check which states generate the most and the least profit and study if they are related to the ones where more purchases were made. Indeed, some states as California and New York and Washington generate both big profit and number of orders. Nevertheless, states as Florida, Texas, Pennsylvania and Colorado despite being associated with high number of purchases generate very low or negative profit. Also remarkable is that states like Nevada and Montana generate a decent amount of profits although they are states where not many purchases are made.
df_mapusa <- df %>% group_by(State) %>% summarize(profitss = sum(Profit))
df_mapusa$region <- tolower(df_mapusa$State)
df_mapusa$value <- df_mapusa$profitss
continental_us_states <- names(table(df_mapusa$region))
state_choropleth(df_mapusa, num_colors=9, zoom = continental_us_states) +
scale_fill_brewer(palette="YlOrBr") +
labs(fill = "Total profit")
Next we see how the store has done in the four years our data covers, by studying how sales and profits have behaved (left barplot below). During the first two years the number of sales remained constant, while the profit increased slightly, proving the store was able to improve their gross margin. Next two years both sales and profits grew to a greater extent.
We are also interested in analyzing what impact the holiday season in the USA (last two months of the year) has on the number of orders (right barplot below). First, there is an increase in the number of orders throughout the years for both periods. Furthermore, the number of orders during holiday season is almost half of the number for the rest of the year; maintaining this constant relationship over the years. Notice that we are comparing two months against 10. Therefore holiday season has a big impact on the number of orders that are made in the store.
Sales_per_year <- df %>% group_by(Order.Year) %>% summarise(s = sum(Sales))
Profit_per_year <- df %>% group_by(Order.Year) %>% summarise(p = sum(Profit))
Sales <- Sales_per_year$s
Profit <- Profit_per_year$p
Years <- c("2014", "2015", "2016", "2017")
df_growthsales_profit <- data.frame(Sales, Profit, Years)
df_growthsales_profit2 <- melt(df_growthsales_profit, id.vars="Years")
plot_year <- ggplot(df_growthsales_profit2, aes(x=Years, y=value, fill=variable))+
geom_bar(stat="identity", position="dodge") +
labs(title="Sales and Profit per year", x = "Years", y="USD") +
theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 38, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
scale_fill_manual(values = c("#da6474", "#80b1d3", "#fdb462", "#7fc97f"))
###############################################################################
Order_preholiday <- df %>% filter(Order.Month %in% 1:10) %>%
group_by(Order.Year) %>% summarise(ord = n())
Order_holiday <- df %>% filter(Order.Month %in% 11:12) %>%
group_by(Order.Year) %>% summarise(ord1= n())
Pre_holiday <- Order_preholiday$ord
Holiday <- Order_holiday$ord1
Years <- c("2014", "2015", "2016", "2017")
df_growthorder <- data.frame(Pre_holiday, Holiday, Years)
df_growthorder2 <- melt(df_growthorder, id.vars='Years')
plot_hol <- ggplot(df_growthorder2, aes(x=Years, y=value, fill=variable)) +
geom_bar(stat='identity', position='dodge') +
labs(title="Orders Pre-Holiday vs Holiday", x = "Years", y="Orders") +
theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 38, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
scale_fill_manual(values = c("#da6474", "#80b1d3", "#fdb462", "#7fc97f" ))
################################################################################
require(gridExtra)
grid.arrange(plot_year, plot_hol, ncol = 2, nrow = 1)
Similar to the analysis carried out before, we can check how sales and profits generated by each category behave over the years. The first thing that catches our attention is the fact that the profit has increased over the years for the Office Supply and Technology categories, but not for Furniture, for which it has fluctuated around the same value. However, sales generally show a tendency to increase over the years. Both Office Supplies and Technology have very similar behavior.
Technology_sales <- df %>% filter(str_detect(Category, 'Technology')) %>%
group_by(Order.Year) %>% summarise(Sales_technology = sum(Sales))
Technology_profit <- df %>% filter(str_detect(Category, "Technology")) %>%
group_by(Order.Year) %>% summarise(Profit_technology = sum(Profit))
Years <- c("2014", "2015", "2016", "2017")
Sales <- Technology_sales$Sales_technology
Profit <- Technology_profit$Profit_technology
df_technology_sp <- data.frame(Sales, Profit, Years)
df_technology_sp2 <- melt(df_technology_sp, id.vars="Years")
plot_technology <- ggplot(df_technology_sp2, aes(x=Years, y=value, fill=variable)) +
geom_bar(stat="identity", position="dodge") +
labs(title="Sales and Profit for Technology", x = "Years", y="USD") +
theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 38, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
scale_fill_manual(values = c("#da6474", "#80b1d3", "#fdb462", "#7fc97f" ))
############################################################################
Furniture_sales <- df %>% filter(str_detect(Category, 'Furniture')) %>%
group_by(Order.Year) %>% summarise(Sales_furniture = sum(Sales))
Furniture_profit <- df %>% filter(str_detect(Category, "Furniture")) %>%
group_by(Order.Year) %>% summarise(Profit_furniture = sum(Profit))
Profit <- Furniture_profit$Profit_furniture
Sales <- Furniture_sales$Sales_furniture
df_furniture_sp <- data.frame(Sales, Profit, Years)
df_furniture_sp2 <- melt(df_furniture_sp, id.vars="Years")
plot_furniture <- ggplot(df_furniture_sp2, aes(x=Years, y=value, fill=variable)) +
geom_bar(stat="identity", position="dodge") +
labs(title="Sales and Profit for Furniture", x = "Years", y="USD") +
theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 38, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
scale_fill_manual(values = c("#da6474", "#80b1d3", "#fdb462", "#7fc97f" ))
#############################################################################
Office_supplies_sales <- df %>% filter(str_detect(Category, 'Office Supplies')) %>%
group_by(Order.Year) %>% summarise(Sales_office_supplies = sum(Sales))
Office_supplies_profit <- df %>% filter(str_detect(Category, "Office Supplies")) %>%
group_by(Order.Year) %>% summarise(Profit_office_supplies = sum(Profit))
Profit <- Office_supplies_profit$Profit_office_supplies
Sales <- Office_supplies_sales$Sales_office_supplies
df_office_supplies_sp <- data.frame(Sales, Profit, Years)
df_office_supplies_sp2 <- melt(df_office_supplies_sp, id.vars="Years")
plot_office <- ggplot(df_office_supplies_sp2, aes(x=Years, y=value, fill=variable)) +
geom_bar(stat="identity", position="dodge") +
labs(title="Sales and Profit for Office Supplies", x = "Years", y="USD") +
theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 38, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
scale_fill_manual(values = c("#da6474", "#80b1d3", "#fdb462", "#7fc97f" ))
################################################################################
require(gridExtra)
grid.arrange(plot_technology, plot_furniture, plot_office, ncol = 3, nrow = 1)
[cell below]
Following we will analyze four plots that would help us understand how the profit changes for different categories, regions and segments as well as the relation between discount and quantity bought.
df1 <- df %>% group_by(Segment,Category) %>% summarise(n=median(Profit))
plot1 <- ggplot(df1, aes(Segment, Category, fill=n))+
scale_fill_distiller( direction = 1)+ geom_tile(color='white')+
geom_text(aes(label=paste(round(n,0),'$')), color = 'black', size=10)+
labs(title='', fill='Median Profit') + theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 16, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28))
################################################################################
df2 <- df %>% group_by(Region,Category) %>% summarise(n=median(Profit))
plot2 <- ggplot(df2, aes(Region, Category, fill=n))+
scale_fill_distiller( direction = 1)+ geom_tile(color='white')+
geom_text(aes(label=paste(round(n,0),'$')), color = 'black', size=10)+
labs(title='', fill='Median Profit') + theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 16, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28))
################################################################################
# df3 <- df %>% mutate(Discount = cut_width(Discount,0.2,boundary=0))
df3 <- df %>% mutate(Discount = cut(Discount, c(0, 0.1, 0.2, 0.8), include.lowest = TRUE))
df3 <- df3 %>% group_by(Discount, Sub.Category) %>% summarise(n = mean(Quantity))
plot3 <- ggplot(df3, aes(Discount, Sub.Category, fill= n))+
scale_fill_distiller( direction = 1) + geom_tile(color='white')+
geom_text(aes(label=paste(round(n,2))), color = 'black', size=10)+
labs(title='', x = "Discount", fill='Avg. Quantity') + theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 16, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28))
################################################################################
df4 <- df %>% mutate(Discount = cut_width(Discount,0.15,boundary=0))
df4 <- df4 %>% group_by(Discount)
plot4 <- ggplot(df4, aes(x=Discount, y= Profit, fill = Category))+
geom_bar(stat = "identity", position = "stack") + labs(title="", x="Discount", y="Profit") +
theme_minimal()+
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 16, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
scale_y_continuous(label = scales::dollar) +
scale_fill_manual(values = c("#da6474", "#80b1d3", "#fdb462", "#7fc97f" )) +
geom_hline(yintercept=0, linetype="dashed")
################################################################################
require(gridExtra)
grid.arrange(plot1, plot2, plot3, plot4, ncol = 2, nrow = 2)
Top left
The median profit generated by each segment oscillates around the same value for each category, and is much bigger for Technology than for Office Supplies and Furniture.
Top right
West is the region where the median profit is highest, although in South both Technology and Furniture present the same median values. For the Central region, Furniture has a negative median profit and Office Supplies shows the smallest positive median profit across all regions and categories.
Bottom left
For some subcategories there is no difference in the quantity bought whether there is a discount or not; clients buy the same quantity on average. For other subcategories, like Binders, Bookcases and Fasteners the quantity bought tends to increase with an increasing discount.
Bottom right
Profit changes when a discount is applied and we can observe that from 30% discount the profit starts turning negative.
A study of the best and worst clients for the store is important to understand how to maximize the profits and take care of the clients, but also to know how to minimize the losses. The first barplot represents the top-10 clients according to purchases over the four years. The amount of orders they made is around 30, being the customer with client ID WB-21850 the one that more orders has made. These clients make most of their purchases on the office supplies category. Also, 6 out of 10 of these clients are Consumer type of client.
Next step is to analyze the top-10 clients that generate more profit and more losses (middle left and right barplots respectively). The “most profitable” clients generate around 5000 USD each, being TC-20980 (Corporate) the best client according to profit generation. Most of their purchases are on the technology category for some clients and office supplies for others. Six out of ten of these clients belong to the Consumer segment.
Clients that generate the largest negative profit cause losses that range from 2000 to 6500 USD approximately. By far, the client that causes the more losses is CS-12505 (Consumer). As with the most profitable clients, the losses are usually related to the technology and office supplies categories. Only two of these clients are from the Home Office segment, four from the Consumer segment and four from the Corporate segment.
We can observe that the top-10 clients according to orders made, generate in general less than 5 times the profit that the most “profitable” clients generate. In fact, some of them generate negative profit. This is important to check, because it’s not profitable to have a client that makes many purchases but at the end causes losses.
Finally, we plot the top-10 clients that more money spend. It is remarkable to see that none of the clients that generate the more profit is in the list of clients that more money spend. There is one client in this list however, that is also in the top-10 clients that more purchases have made. This client is the one who more money has spent over the four years and he/she belongs to the Corporate segment.
top_buyers <- tail(names(sort(table(df$Customer.ID))), 10)
df_top_buyers <- df %>% filter(df$Customer.ID %in% top_buyers)
segment_purch <- df_top_buyers %>% group_by(Customer.ID, Segment) %>%
summarize(Sales = sum(Sales), Profit = sum(Profit), Number.Orders = n())
#kable(segment_purch[1:10,])
segment_purch[1:10,] %>%
kable() %>%
kable_styling(full_width = F)
Customer.ID | Segment | Sales | Profit | Number.Orders |
---|---|---|---|---|
CK-12205 | Consumer | 3154.855 | 141.2831 | 32 |
EH-13765 | Corporate | 10310.880 | 1393.5154 | 32 |
EP-13915 | Consumer | 5478.061 | 144.9578 | 31 |
JD-15895 | Corporate | 7610.864 | 1050.2668 | 32 |
JL-15835 | Consumer | 9799.923 | 228.9070 | 34 |
MA-17560 | Home Office | 4299.161 | 1240.2266 | 34 |
PP-18955 | Home Office | 7252.608 | 1495.0854 | 34 |
SV-20365 | Consumer | 11470.950 | 1199.4242 | 32 |
WB-21850 | Consumer | 6160.102 | 714.3311 | 37 |
ZC-21910 | Consumer | 8025.707 | -1032.1490 | 31 |
plot_top <- ggplot(df_top_buyers, aes(x = Customer.ID, fill = Category)) +
geom_bar(position = "stack",bins = 10) +
labs(x = "Clients", y="Orders") +
theme_minimal() +
theme(axis.text=element_text(size=24), axis.text.x = element_text(angle=45),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28), plot.title = element_text(size = 38, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
ggtitle("Top 10 clients according to purchases") +
scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))
################################################################################
plot_top_profit <- ggplot(df_top_buyers, aes(x = Customer.ID, y = Profit, fill = Category)) +
geom_bar(stat="identity", position = "stack", bins = 10) +
labs(x = "Clients", y="Profit") +
theme_minimal() +
theme(axis.text=element_text(size=24), axis.text.x = element_text(angle=45),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
plot.title = element_text(size = 38, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
ggtitle("Top 10 clients according to purchases") +
scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))+
scale_y_continuous(label = scales::dollar)+
geom_hline(yintercept=0, linetype="dashed")
################################################################################
df_profit <- df %>% group_by(Customer.ID) %>% summarise(prof = sum(Profit))
df_profit <- df_profit[order(df_profit$prof),]
top_profit_neg <- head(df_profit$Customer.ID, 10)
top_profit_pos <- tail(df_profit$Customer.ID, 10)
df_top_pos <- df %>% filter(Customer.ID %in% top_profit_pos)
segment_prof_pos <- df_top_pos %>% group_by(Customer.ID, Segment) %>%
summarize(Sales = sum(Sales), Profit = sum(Profit), Number.Orders = n())
segment_prof_pos[1:10,] %>%
kable() %>%
kable_styling(full_width = F)
Customer.ID | Segment | Sales | Profit | Number.Orders |
---|---|---|---|---|
AB-10105 | Consumer | 14473.571 | 5444.806 | 20 |
AR-10540 | Consumer | 6608.448 | 2884.621 | 9 |
CM-12385 | Consumer | 8954.020 | 3899.890 | 10 |
DR-12940 | Home Office | 8350.868 | 2869.076 | 13 |
HL-15040 | Consumer | 12873.298 | 5622.429 | 11 |
KD-16495 | Corporate | 8181.256 | 3038.625 | 28 |
RB-19360 | Consumer | 15117.339 | 6976.096 | 18 |
SC-20095 | Consumer | 14142.334 | 5757.412 | 22 |
TA-21385 | Home Office | 14595.620 | 4703.788 | 10 |
TC-20980 | Corporate | 19052.218 | 8981.324 | 12 |
plot_top_pos <- ggplot(df_top_pos, aes(x = Customer.ID, y = Profit, fill = Category)) +
geom_bar(stat="identity", position = "stack", bins = 10) +
labs(x = "Clients", y="Profit") +
theme_minimal() +
theme(axis.text=element_text(size=24), axis.text.x = element_text(angle=45),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
plot.title = element_text(size = 38, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
ggtitle("Top 10 clients with positive profit") +
scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))+
scale_y_continuous(label = scales::dollar)+
geom_hline(yintercept=0, linetype="dashed")
################################################################################
df_top_neg <- df %>% filter(df$Customer.ID %in% top_profit_neg)
segment_prof_neg <- df_top_neg %>% group_by(Customer.ID, Segment) %>%
summarize(Sales = sum(Sales), Profit = sum(Profit), Number.Orders = n())
segment_prof_neg[1:10,] %>%
kable() %>%
kable_styling(full_width = F)
Customer.ID | Segment | Sales | Profit | Number.Orders |
---|---|---|---|---|
CP-12340 | Corporate | 5888.275 | -1850.303 | 15 |
CS-12505 | Consumer | 5690.055 | -6626.390 | 9 |
GT-14635 | Corporate | 9351.212 | -4108.659 | 6 |
HG-14965 | Corporate | 3247.642 | -2797.963 | 17 |
LF-17185 | Consumer | 3930.509 | -3583.977 | 16 |
NC-18415 | Consumer | 2218.990 | -2204.807 | 14 |
NF-18385 | Consumer | 8322.826 | -1695.971 | 14 |
SB-20290 | Corporate | 8057.891 | -2082.745 | 17 |
SM-20320 | Home Office | 25043.050 | -1980.739 | 15 |
SR-20425 | Home Office | 3233.481 | -3333.914 | 9 |
plot_top_neg <- ggplot(df_top_neg, aes(x = Customer.ID, y = Profit, fill = Category)) +
geom_bar(stat = "identity", position = "stack", bins = 10) +
labs(x = "Clients", y="Profit") +
theme_minimal() +
theme(axis.text=element_text(size=24), axis.text.x = element_text(angle=45),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
plot.title = element_text(size = 38, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
ggtitle("Top 10 clients with negative profit") +
scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))+
scale_y_continuous(label = scales::dollar)+
geom_hline(yintercept=0, linetype="dashed")
################################################################################
df_sales <- df %>% group_by(Customer.ID) %>% summarise(sales = sum(Sales))
df_sales <- df_profit[order(df_sales$sales),]
top_sales <- tail(df_sales$Customer.ID, 10)
df_top_sales <- df %>% filter(Customer.ID %in% top_sales)
segment_sales <- df_top_sales %>% group_by(Customer.ID, Segment) %>%
summarize(Sales = sum(Sales), Profit = sum(Profit), Number.Orders = n())
segment_sales[1:10,] %>%
kable() %>%
kable_styling(full_width = F)
Customer.ID | Segment | Sales | Profit | Number.Orders |
---|---|---|---|---|
EH-13765 | Corporate | 10310.880 | 1393.5154 | 32 |
HG-15025 | Consumer | 3690.284 | 804.6113 | 11 |
JF-15565 | Consumer | 4198.332 | 1073.2088 | 16 |
KM-16225 | Corporate | 2260.958 | 635.1057 | 19 |
MR-17545 | Home Office | 639.178 | 162.9363 | 6 |
RH-19510 | Home Office | 6979.180 | 1289.4535 | 12 |
SB-20290 | Corporate | 8057.891 | -2082.7451 | 17 |
SM-20005 | Consumer | 244.494 | -26.5927 | 6 |
TB-21355 | Corporate | 834.328 | 268.9670 | 13 |
TS-21205 | Corporate | 5371.090 | 862.8877 | 19 |
plot_sales <- ggplot(df_top_sales, aes(x = Customer.ID, y = Sales, fill = Category)) +
geom_bar(stat="identity", position = "stack", bins = 10) +
labs(x = "Clients", y="Sales") +
theme_minimal() +
theme(axis.text=element_text(size=24), axis.text.x = element_text(angle=45),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
plot.title = element_text(size = 38, hjust = 0.5),
legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
ggtitle("Top 10 clients according to sales") +
scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))+
scale_y_continuous(label = scales::dollar)
################################################################################
require(gridExtra)
grid.arrange(plot_top, plot_top_profit, plot_top_pos, plot_top_neg, plot_sales, ncol = 2, nrow = 3)
In this section we intend to analyze how the different continuous variables behave with respect to the profit. We will see the distribution of Profit with respect to Sales, Time, Month and Quantity. Also we will plot Quantity vs Discount. Therefore, this section will be heavy on scatter plots, but their analysis is fundamental to find important relations in the continuous variables.
But first, let’s start by looking at how profit and sales change over time. The two plots on the left of the next figure shows how the profit and sales tend to increase over time. The two plots on the right show a moving average of the last three months. Now, it’s possible to see much better how there are cycles for sales and profit. They start low at the beginning of each year and keep increasing until November and December, after which they drop again.
by_yearmonth <- df %>% group_by(Order.Year_month)
sales_year_tot <- by_yearmonth %>% summarize(Sales_year_tot=sum(Sales))
plot1 <- ggplot(sales_year_tot, aes(x = Order.Year_month, y = Sales_year_tot)) +
geom_line(size = 1.5, color = "lightgrey") + geom_smooth() +
geom_point(size = 6, color = "#80b1d3") + theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 16, hjust = 0.5)) +
geom_point(size = 5, color = "#80b1d3") + labs(y = "Total sales", x = "Time")+
scale_y_continuous(label = scales::dollar)
################################################################################
df_fullyear <- df %>% group_by(Order.Year_month) %>%
summarise(m2 = mean(Sales), mx2 = max(Sales), mn2 = min(Sales) ,su2 = sum(Sales))
df_fullyear$Moving_average <- rollapply(df_fullyear$su2, 3, mean, align = "right", fill = NA)
df_fullyear[1,"Moving_average"] = df_fullyear[1,"su2"]
df_fullyear[2,"Moving_average"] = sum(df_fullyear[1,"su2"], df_fullyear[2,"su2"])/2
#since the moving average is with the past 3 months, the first two row are NA.
# the above lines find a workaround
plot2 <- ggplot(df_fullyear, aes(x = Order.Year_month, y = Moving_average)) +
geom_line(size = 1.5, color = "lightgrey") + geom_smooth() +
geom_point(size = 6, color = "#80b1d3") + theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 16, hjust = 0.5)) +
labs(y = "Avg. sales of last 3 months", x = "Time") +
scale_y_continuous(label = scales::dollar)
################################################################################
by_yearmonth <- df %>% group_by(Order.Year_month)
profit_year_tot <- by_yearmonth %>% summarize(Profit_year_tot=sum(Profit))
plot3 <- ggplot(profit_year_tot, aes(x = Order.Year_month, y = Profit_year_tot)) +
geom_line(size = 1.5, color = "lightgrey") + geom_smooth() +
geom_point(size = 6, color = "#80b1d3") + theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 16, hjust = 0.5)) +
labs(y = "Total profit", x = "Time")+
scale_y_continuous(label = scales::dollar)
################################################################################
df_fullyear <- df %>% group_by(Order.Year_month) %>%
summarise(m2 = mean(Profit), mx2 = max(Profit), mn2 = min(Profit) ,su2 = sum(Profit))
df_fullyear$Moving_average <- rollapply(df_fullyear$su2, 3, mean, align = "right", fill = NA)
df_fullyear[1,"Moving_average"] = df_fullyear[1,"su2"]
df_fullyear[2,"Moving_average"] = sum(df_fullyear[1,"su2"], df_fullyear[2,"su2"])/2
#since the moving average is with the past 3 months, the first two row are NA.
# the above lines find a workaround
plot4 <- ggplot(df_fullyear, aes(x = Order.Year_month, y = Moving_average)) +
geom_line(size = 1.5, color = "lightgrey") + geom_smooth() +
geom_point(size = 6, color = "#80b1d3") + theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28),
plot.title = element_text(size = 16, hjust = 0.5)) +
labs(y = "Avg. profit of last 3 months", x = "Time")+
scale_y_continuous(label = scales::dollar)
################################################################################
require(gridExtra)
grid.arrange(plot1, plot2, plot3, plot4, ncol = 2, nrow = 2)
The Profit vs Sale distribution has a fan shape, indicating that more sales does not necessarily implies more profit. Sometimes the profit tends to decrease. This behavior is observed for the three categories.
ggplot(df, aes(x = Sales, y = Profit, color=Category)) +
geom_point(size = 2, alpha=.8) +
scale_x_continuous(label = scales::dollar) +
scale_y_continuous(label = scales::dollar) +
labs(x = "Sales", y = "Profit")+
theme_minimal() + theme(axis.text=element_text(size=6),
axis.title.x = element_text(size = 10), axis.title.y = element_text(size = 10),
legend.text = element_text(size = 6), legend.title = element_text(size=10))
If we now analyze the same distribution but independently for each sub category we will see that some still preserve the fan shape, but others such as Art, Copiers, Envelopes, Paper and Labels show an always increasing profit when the sales increase. Furthermore we can see that, when classifying by discount it is possible to discern for which discount ranges the profit decreases when the sales increase. Based on this, a number of suggestions we can make to the store are:
# Cut width of Discount
df_disc <- df %>% mutate(Discount=cut_width(Discount,0.10,boundary=0))
df1 <- df_disc
plot1 <- ggplot(df1, aes(x = Sales, y = Profit, color=Discount)) +
geom_point(size = 5, alpha=.8) +
facet_wrap(~Sub.Category, scales = "free") +
scale_x_continuous(label = scales::dollar) +
scale_y_continuous(label = scales::dollar) +
labs(x = "Sales", y = "Profit") + theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.text.x = element_text(angle=45),
axis.title.y = element_text(size = 28), strip.text = element_text(size = 28),
legend.text = element_text(size = 24), legend.title = element_text(size=28))
require(gridExtra)
grid.arrange(plot1, ncol = 1, nrow = 1)
The Profit vs Time analysis shows there is in general a positive relation for both orders and profit. With orders we can also see there is a cycle. For the first two months of the year the orders are at their lowest level. Then there is an increase in the number of orders, that is kept for the following seven months. Finally, it increases again for the last three months of the year, being November the month where the peak is reached every year. Profit, however does not show this cycle so clearly.
# Profit over time
df_plot <- df %>% group_by(Order.Year_month)
df_plot <- df_plot %>% summarize(profitt = sum(Profit))
plot1 <- ggplot(df_plot, aes(x = Order.Year_month, y = profitt)) +
geom_point(color="#80b1d3", size = 6, alpha=.8) +
scale_y_continuous(label = scales::dollar) +
labs(x = "Time", y = "Total profit")+
geom_smooth(method = "lm") + theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28), plot.title = element_text(size = 38,
hjust = 0.5), legend.text = element_text(size = 24),
legend.title = element_text(size=28)) + stat_cor(method = "pearson", size = 10)
######################################################################
# Orders over time
df_plot <- df %>% group_by(Order.Year_month)
df_plot <- df_plot %>% summarize(orders = n())
plot2 <- ggplot(df_plot, aes(x = Order.Year_month, y = orders)) +
geom_point(color="#80b1d3", size = 6, alpha=.8) +
scale_y_continuous(label = scales::dollar) +
labs(x = "Time", y = "Total orders")+
geom_smooth(method = "lm") + theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28), plot.title = element_text(size = 38,
hjust = 0.5), legend.text = element_text(size = 24),
legend.title = element_text(size=28)) + stat_cor(method = "pearson", size = 10)
#######################################################################
require(gridExtra)
grid.arrange(plot1, plot2, ncol = 2, nrow = 1)
If we take this analysis to each subcategory, it can be seen that Supplies, Machines and Tables show in general a decrease in the profit. In some of them it is possible to sense the cycle that we talked about before, such is the case of Appliances, Fasteners and Phones.
#BY SUBCATEGORIES
df_plot_SC <- df %>% group_by(Sub.Category, Order.Year_month)
df_plot_SC <- df_plot_SC %>% summarize(profit = sum(Profit))
ggplot(df_plot_SC, aes(x = Order.Year_month, y = profit)) +
geom_point(color="#80b1d3", size = 5, alpha=.8) +
facet_wrap(~Sub.Category, scales = "free") +
scale_y_continuous(label = scales::dollar) +
#geom_hline(yintercept=0, color="#da6474", size=0.5)+
labs(x = "Time", y = "Profit")+
geom_smooth(method = "lm")+
theme_minimal()+ theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
strip.text = element_text(size = 28),
legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
stat_cor(method = "pearson", size = 7)
#BY STATE
df_plot_SC <- df %>% group_by(State, Order.Year_month)
df_plot_SC <- df_plot_SC %>% summarize(profit = sum(Profit))
ggplot(df_plot_SC, aes(x = Order.Year_month, y = profit)) +
geom_point(color="#80b1d3", size = 5, alpha=.8) +
facet_wrap(~State, scales = "free") +
scale_y_continuous(label = scales::dollar) +
#geom_hline(yintercept=0, color="#da6474", size=0.5)+
labs(x = "Time", y = "Profit")+
geom_smooth(method = "lm")+
theme_minimal() + theme(axis.text=element_text(size=24),
axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
strip.text = element_text(size = 28), axis.text.x = element_text(angle=45),
legend.text = element_text(size = 24), legend.title = element_text(size=28))
Grouping by state we can observe that some of the states don’t have enough data to draw conclusions from (DC, N. Dakota, W. Virginia), others show a decreasing behavior that should be payed attention to (Arkansas, Iowa, Minnesota) and others show a good positive increase of the profit over time (California, New York).
This analysis would help us confirm the cycle observed before. First months of the year come along with lower profits, which increase towards the end of the year. This is indeed what is observed for all subcategories except for Machines and possibly Supplies and Tables. Most of the states show this behavior as well, except for Ordegon, Tennessee and Georgia. Knowing this could potentially help the store address better the clients of each state, to know when to launch campaigns of offers and discounts for instance.
#BY SUBCATEGORIES
df_plot_SC <- df %>% group_by(Sub.Category, Order.Month)
df_plot_SC <- df_plot_SC %>% summarize(profit = sum(Profit))
ggplot(df_plot_SC, aes(x = Order.Month, y = profit)) +
geom_point(color="#80b1d3", size = 5, alpha=.8) +
facet_wrap(~Sub.Category, scale = "free") +
scale_y_continuous(label = scales::dollar) +
labs(x = "Month", y = "Profit") +
geom_smooth(method = "lm") + theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28), strip.text = element_text(size = 28))+
stat_cor(method = "pearson", size = 7)+
scale_x_discrete(limits= seq(1,12), breaks = function(x){x[c(TRUE, FALSE)]})
#BY STATE
df_plot_SC <- df %>% group_by(State, Order.Month)
df_plot_SC <- df_plot_SC %>% summarize(profit = sum(Profit))
ggplot(df_plot_SC, aes(x = Order.Month, y = profit)) +
geom_point(color="#80b1d3", size = 5, alpha=.8) +
facet_wrap(~State, scale = "free") +
scale_y_continuous(label = scales::dollar) +
labs(x = "Month", y = "Profit") +
geom_smooth(method = "lm") + theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28), strip.text = element_text(size = 28))+
scale_x_discrete(limits= seq(1,12), breaks = function(x){x[c(TRUE, FALSE)]})
This is also an interesting relation to see. For most of the sub categories an increase in the amount of items (of the same product) bought leads to a decrease in profit. One of the reasons could be that buyers take advantage of discounts to buy in bigger quantities, hence the profit would be less when compared to single item orders when no discount was present. Or simply the store offers more discounts when the quantity bought is bigger. Next section will try to answer to this.
#BY SUBCATEGORIES
df_plot_SC <- df %>% group_by(Sub.Category, Quantity)
df_plot_SC <- df_plot_SC %>% summarize(profit = sum(Profit))
ggplot(df_plot_SC, aes(x = Quantity, y = profit)) +
geom_point(color="#80b1d3", size = 5, alpha=.8) +
facet_wrap(~Sub.Category, scale = "free") +
scale_y_continuous(label = scales::dollar) +
#geom_hline(yintercept=0, color="#da6474", size=0.5)+
labs(x = "Quantity", y = "Profit") +
geom_smooth(method = "lm") + theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28), strip.text = element_text(size = 28)) + stat_cor(method = "pearson", size = 7)
There is not enough data to draw conclusions properly, since the number of discounts is rather small. However, for Binders and Bookcases bigger discounts imply bigger quantities. For Tables the opposite behavior seems to be the case.
#BY SUBCATEGORIES
df_plot_SC <- df %>% group_by(Sub.Category, Discount)
df_plot_SC <- df_plot_SC %>% summarize(quantity = mean(Quantity))
ggplot(df_plot_SC, aes(x = Discount, y = quantity)) +
geom_point(color="#80b1d3", size = 5, alpha=.8) +
facet_wrap(~Sub.Category, scale = "free") +
labs(x = "Discount", y = "Quantity") +
geom_smooth(method = "lm") + theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28), strip.text = element_text(size = 28)) +
stat_cor(method = "pearson", size = 7)
Finally we can see in general lines that two types of customers benefit from discounts when they buy more items or they buy bigger quantities because they are in discount. Those are Corporate and Home Office.
#BY SEGMENT
df_plot_SC <- df %>% group_by(Segment, Discount)
df_plot_SC <- df_plot_SC %>% summarize(quantity = mean(Quantity))
ggplot(df_plot_SC, aes(x = Discount, y = quantity)) +
geom_point(color="#80b1d3", size = 6, alpha=.8) +
facet_wrap(~Segment, scale = "free") +
labs(x = "Discount", y = "Quantity") +
geom_smooth(method = "lm") + theme_minimal() +
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
axis.title.y = element_text(size = 28), strip.text = element_text(size = 28)) + stat_cor(method = "pearson", size = 7)
The correlation matrix can be seen below. The highest correlation is given between Gross Margin and Discount and it is a negative correlation. It makes sense that the more discounts the store offers the lower the gross margin is since it is proportional to profit and inversely proportional to sales. The biggest the discount, the lower the profit. Same analysis explains the negative correlation between Discount and Profit. Sales and Profit have a 0.48 positive correlation, indicating that profit tends to increase with sales in general.
Also noticeable are the positive correlations between Profit and Gross Margin and between Quantity and Sales. The more items are bought of a certain product, the more sales it generates.
dat <- df[,c("Sales","Profit","Gross.Margin" , "Discount", "Quantity", "Processing.Time", "Order.Year")]
colnames(dat) <- c("Sales", "Profit","Gross.Margin", "Discount", "Quantity", "Processing.Time", "Year")
df_corr <- dplyr::select_if(dat, is.numeric)
corr_coef <- cor(df_corr, use="complete.obs")
round(corr_coef,2)
## Sales Profit Gross.Margin Discount Quantity Processing.Time
## Sales 1.00 0.48 0.00 -0.03 0.20 -0.01
## Profit 0.48 1.00 0.22 -0.22 0.07 0.00
## Gross.Margin 0.00 0.22 1.00 -0.86 -0.01 -0.01
## Discount -0.03 -0.22 -0.86 1.00 0.01 0.00
## Quantity 0.20 0.07 -0.01 0.01 1.00 0.02
## Processing.Time -0.01 0.00 -0.01 0.00 0.02 1.00
## Year -0.01 0.00 0.00 0.00 -0.01 -0.02
## Year
## Sales -0.01
## Profit 0.00
## Gross.Margin 0.00
## Discount 0.00
## Quantity -0.01
## Processing.Time -0.02
## Year 1.00
ggcorrplot(corr_coef, hc.order = TRUE, type = "lower", lab = TRUE)
We also use another library to output the 8 biggest correlations with a p-value < 0.05. The results agree with what was explained before.
dat <- df[, c("Sales","Profit","Gross.Margin" , "Discount", "Quantity", "Processing.Time", "Order.Year")]
colnames(dat) <- c("Sales", "Profit","Gross.Margin", "Disc.", "Quantity", "Processing.Time", "Year")
corr_cross(dat, max_pvalue = 0.05, top = 8, color=c("#da6474", "#80b1d3"))
In this section we will address a simple linear regression problem before we get into a deeper analysis with multiple linear regression.
The original idea was to do linear regression on Profit vs Sales in general. However they are not very correlated and the shape of the distribution clearly indicates there is no linear relation. Hence, based on the EDA we realized that there seems to be a good relation between these to features for sub categories such as Envelopes, Copiers, Paper and Art. We have chosen Envelopes to model its behavior with linear regression.
First, we do a train-test split with a train set size of 80%.
df_envelop <- df %>% filter(`Sub.Category`=='Labels')
row.number <- sample(1:nrow(df_envelop), 0.8*nrow(df_envelop))
train = df_envelop[row.number,]
test = df_envelop[-row.number,]
dim(train)
## [1] 291 22
dim(test)
## [1] 73 22
Proceed to fit the model and print summary information.
lm.fit <- lm(Profit ~ Sales, data = train)
summary(lm.fit)
##
## Call:
## lm(formula = Profit ~ Sales, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.483 -0.424 0.483 0.870 29.581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.365057 0.295019 -1.237 0.217
## Sales 0.452852 0.003638 124.465 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.604 on 289 degrees of freedom
## Multiple R-squared: 0.9817, Adjusted R-squared: 0.9816
## F-statistic: 1.549e+04 on 1 and 289 DF, p-value: < 2.2e-16
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) -0.9457144 0.2156012
## Sales 0.4456910 0.4600133
The \(R^2\) equals 0.96, which means the independent variable is able to explain most of the variance in the dependent variable. Also, the F-statistic is quite bigger than one and its p-value very close to zero, meaning there is indeed evidence of a relation between both variables. The estimate of 0.46 gives us the expected change in Profit due to a unit change in Sales.
A plot of the fit can be seen next:
ggplot(train, aes(x = Sales, y = Profit)) +
geom_point(color="#80b1d3", size = 2, alpha=.9) +
scale_y_continuous(label = scales::dollar) +
scale_x_continuous(label = scales::dollar) +
labs(x = "Sales", y = "Profit") + theme_minimal() +
theme(axis.text=element_text(size=8), axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12)) + geom_abline(slope = coef(lm.fit)[["Sales"]],
intercept = coef(lm.fit)[["(Intercept)"]])
Let’s now have a look at the diagnostic plots.
par(mfrow = c(2, 2))
plot(lm.fit)
Fitted vs Residual
Residuals plots are useful to tell whether linearity and homoskedasticity hold. For linearity we expect the red line to be close to zero, while for the homoskedasticity the spread of the residuals should be approximately the same across the x axis.From the above plot, we can see that the red trend starts very close to zero but then diverges away from it. We are able to see some pattern in the data: for an increase of the fitted values some residuals increase while tend to decrease. So neither linearity nor homoskedasticity are really met. There are three potential outliers.
Normal Q-Q
In this plot we expect to see all points close to the dotted line. If this was not the case, the residuals and therefore the errors aren’t Gaussian. Thus for small sample sizes, it can’t be assumed that the estimator \(\hat{\beta}\) is Gaussian either, meaning the standard confidence intervals and significance tests are invalid. Our plot shows most points lying on the dotted line, however the rightmost and leftmost points are already far from it, suggesting strong right and left skewness.
Scale-Location
This kind of plot is ideal to check for homoskedasticity. We expect the red line to be approximately horizontal, and since this is not the case for our model, it means that the average magnitude of the standardized residuals is changing considerably as a function of the fitted values. The spread around the red line should not vary with the fitted values and again, this is happening in our plot. Therefore, we have another evidence for heteroskedasticity.
Residuals vs Leverage
Leverage measures how sensitive a fitted \(\hat{y_{i}}\) is to a sense in the true response \(y_{i}\). When we look at the Residuals vs Leverage plot we expect that the spread of standardized residuals won’t change as a function of leverage. In our plot however we see it increases. This is another proof of heteroskedasticity.Finally, points beyond the Cook’s distance have a large leverage and deleting them would have a big influence.
By the previous analysis, we can conclude that our data is not behaving linearly as we expected and it does not have homoskedasticity. We could tackle these problems by doing some transformations on the data, like taking the logarithm or the square root of the response, but in this case we have both positive and negative values, which would make the problem unnecessarily difficult and actually introduce more non-linearity and heteroskedasticity. We can’t assume however that the standard errors of the intercept and Sales are valid, given the aforementioned assumptions are not met.
To assert how the model generalizes we predict the values of the test set and calculate the RMSE. RMSE explains on an average how much of the predicted values will differ from the actual values. For a more meaningful comparison we calculated the normalized RMSE.
pred <- predict(lm.fit, newdata = test)
rmse <- sqrt(sum((pred - test$Profit)^2)/length(test$Profit))
normalized_rmse <- 100* rmse/(max(train$Profit)-min(train$Profit))
cat(paste("RMSE =", rmse))
## RMSE = 3.95568306022249
cat(paste("\nNormalized RMSE =", normalized_rmse, "%"))
##
## Normalized RMSE = 1.02826046817739 %
The shown normalized RMSE is low enough to consider the fit as good.
In this section we will train a regression model using multiple features. As part of the process we will select the most important features for the model using backward selection and also check how well it performs on a test set.
First, we separate our data in train and test sets, with a ratio of 80%-20% respectively.
set.seed(1)
row.number <- sample(1:nrow(df), 0.8*nrow(df))
train = df[row.number,]
test = df[-row.number,]
dim(train)
## [1] 7995 22
dim(test)
## [1] 1999 22
Next we carry out the construction of a first model, having into account 9 features that could potentially have a good impact on training and using 5-fold cross validation to enhance the performance.
lm_mult = train(
form = Profit ~ Sales + Quantity + Segment + Discount + Processing.Time + Ship.Mode + Region + Sub.Category + Gross.Margin,
data = train,
trControl = trainControl(method = "cv", number = 5),
method = "lm"
)
lm_mult
## Linear Regression
##
## 7995 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 6397, 6395, 6395, 6396, 6397
## Resampling results:
##
## RMSE Rsquared MAE
## 203.5605 0.3422689 58.7417
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
summary(lm_mult)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7006.0 -24.3 -1.0 28.7 4567.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.882e+01 1.285e+01 3.799 0.000146 ***
## Sales 1.897e-01 3.838e-03 49.437 < 2e-16 ***
## Quantity -2.643e+00 1.002e+00 -2.639 0.008338 **
## SegmentCorporate 3.606e+00 4.980e+00 0.724 0.469041
## `SegmentHome Office` -4.025e-01 6.002e+00 -0.067 0.946532
## Discount -1.608e+02 2.552e+01 -6.301 3.12e-10 ***
## Processing.Time -8.892e-01 2.208e+00 -0.403 0.687218
## `Ship.ModeSame Day` -8.838e+00 1.189e+01 -0.743 0.457207
## `Ship.ModeSecond Class` -1.196e+01 7.796e+00 -1.534 0.124972
## `Ship.ModeStandard Class` -4.015e+00 8.820e+00 -0.455 0.648992
## RegionEast -2.073e+01 6.272e+00 -3.305 0.000954 ***
## RegionSouth -2.883e+01 7.237e+00 -3.984 6.83e-05 ***
## RegionWest -2.595e+01 6.215e+00 -4.176 3.00e-05 ***
## Sub.CategoryAppliances 1.415e+01 1.298e+01 1.091 0.275423
## Sub.CategoryArt -1.201e+01 1.102e+01 -1.090 0.275938
## Sub.CategoryBinders 5.806e+01 1.045e+01 5.558 2.82e-08 ***
## Sub.CategoryBookcases -7.888e+01 1.659e+01 -4.756 2.01e-06 ***
## Sub.CategoryChairs -4.639e+01 1.198e+01 -3.874 0.000108 ***
## Sub.CategoryCopiers 4.551e+02 2.872e+01 15.846 < 2e-16 ***
## Sub.CategoryEnvelopes -8.144e+00 1.607e+01 -0.507 0.612299
## Sub.CategoryFasteners -1.427e+01 1.708e+01 -0.835 0.403522
## Sub.CategoryFurnishings -3.247e+00 1.056e+01 -0.307 0.758492
## Sub.CategoryLabels -1.601e+01 1.389e+01 -1.153 0.249085
## Sub.CategoryMachines -2.551e+02 2.253e+01 -11.324 < 2e-16 ***
## Sub.CategoryPaper -1.081e+01 1.006e+01 -1.074 0.282779
## Sub.CategoryPhones -1.315e+01 1.083e+01 -1.214 0.224659
## Sub.CategoryStorage -3.016e+01 1.091e+01 -2.765 0.005701 **
## Sub.CategorySupplies -6.184e+01 1.738e+01 -3.558 0.000376 ***
## Sub.CategoryTables -1.473e+02 1.489e+01 -9.893 < 2e-16 ***
## Gross.Margin 5.774e-01 1.100e-01 5.250 1.56e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 194.8 on 7965 degrees of freedom
## Multiple R-squared: 0.3634, Adjusted R-squared: 0.3611
## F-statistic: 156.8 on 29 and 7965 DF, p-value: < 2.2e-16
To test the relationship between predictor and response variables we will use hypothesis testing.
\[H_{0} : \beta_{1} = \beta_{2} = ··· = \beta_{10} = 0\] \[H_{a} \text{: at least one } \beta_{j} \text{ is non-zero.}\]
The F stat of 156.8 (>> 1) and a p-value of nearly zero, gives clear evidence against the null hypothesis. At least one of the selected features must be related to Profit.
Judging by the p-value of the predictor variables we can determine which are more significant. The lesser the p-value the more significant the variable is. Features such as Processing.Time, Ship.Mode and Segment are not significant for our model.
The multiple \(R^2\) indicates how much variation is captured by the model. Our value of 0.36 seems low and therefore the amount of variance in the dependent variable that the independent variables explain collectively is not very large.
The \(R^2\) tends to optimistically estimate the fit of the regression and always increases when more predictors are added. The Adjusted \(R^2\) on the other hand is used to determine how reliable the correlation is and increases when the new added predictors improve the model more than would be expected by chance. We will therefore use the adjusted \(R^2\) as the criterion of goodness of the fit. In the previously trained model, the adjusted \(R^2\) is very close to the standard \(R^2\).
To check the multicollinearity in our data we will look at the Variance Inflation Factors (VIF):
vif(lm_mult$finalModel)
## Sales Quantity SegmentCorporate
## 1.345397 1.056995 1.110577
## `SegmentHome Office` Discount Processing.Time
## 1.108745 5.713338 3.142746
## `Ship.ModeSame Day` `Ship.ModeSecond Class` `Ship.ModeStandard Class`
## 1.521895 2.002444 3.942991
## RegionEast RegionSouth RegionWest
## 1.699253 1.482735 1.771716
## Sub.CategoryAppliances Sub.CategoryArt Sub.CategoryBinders
## 1.578024 1.878397 2.974738
## Sub.CategoryBookcases Sub.CategoryChairs Sub.CategoryCopiers
## 1.296510 1.712193 1.208971
## Sub.CategoryEnvelopes Sub.CategoryFasteners Sub.CategoryFurnishings
## 1.320397 1.249844 2.039619
## Sub.CategoryLabels Sub.CategoryMachines Sub.CategoryPaper
## 1.467097 1.269015 2.518917
## Sub.CategoryPhones Sub.CategoryStorage Sub.CategorySupplies
## 1.981297 1.968884 1.240843
## Sub.CategoryTables Gross.Margin
## 1.425779 5.420298
VIFs for Discount and Gross Margin are already moderately high. Since Gross Margin was calculated using the profit and the sales, we choose to keep Discount since it could provide more relevant and unique information.
Let’s now have a look at the diagnostic plots.
model = lm(Profit ~ Sales + Quantity + Segment + Discount + Processing.Time + Ship.Mode + Region + Sub.Category, data = train)
par(mfrow=c(2,2))
plot(model)
Fitted vs Residual graph
The red line is close to zero (except slightly for the rightmost values), which means the assumption of linearity holds. However the spread of the residuals should be approximately the same across the x axis, but it is not, meaning there is heteroskedasticity.
Normal Q-Q Plot
Most of the points lie in the line as expected, except for the ones at the beginning and the end (left and right skewness). There is some room for improvement here. Nevertheless, this would not represent a big problem, since given the size of our test set, the Central Limit Theorem can be invoked.
Scale-Location
The red line is far from horizontal, showing the presence of heteroskedasticity.
Residuals vs Leverage
The spread of standardized residuals changes as a function of leverage, being more spread for high leverage points. This is another proof of heteroskedasticity.
Let’s also try to perform feature selection using Lasso regression. For that, we are going to take the initial model and determine which features are less important for it.
profit <- train$Profit
predictors <- data.matrix(train[, c("Sales", "Quantity", "Segment", "Discount", "Processing.Time", "Ship.Mode", "Region", "Sub.Category", "Gross.Margin")])
First, perform k-fold cross-validation to find optimal lambda value. The parameter alpha = 1 corresponds to lasso regression while alpha = 0 corresponds to ridge regression.
cv_model <- cv.glmnet(predictors, profit, alpha = 1)
#find optimal lambda value that minimizes test MSE
best_lambda <- cv_model$lambda.min
best_lambda
## [1] 1.86399
#produce plot of test MSE by lambda value
plot(cv_model)
Once the best lambda is determined we can use it to find the coefficients of the model.
best_model <- glmnet(predictors, profit, alpha = 1, lambda = best_lambda)
coef(best_model)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 44.0016113
## Sales 0.1856149
## Quantity -1.8452988
## Segment .
## Discount -64.1884105
## Processing.Time .
## Ship.Mode .
## Region -6.0120307
## Sub.Category -4.1847286
## Gross.Margin 0.9213551
We observe that Ship.Mode, Processing.Time and Segment are not significant for the model, therefore, we can confirm the results obtained previously.
For the sake of completeness we can also check what results we get with Ridge regression on our initial set of features.
profit <- train$Profit
predictors <- data.matrix(train[, c("Sales", "Quantity", "Segment", "Discount", "Processing.Time", "Ship.Mode", "Region", "Sub.Category", "Gross.Margin")])
As with lasso regression, we find the best value of lambda:
#perform k-fold cross-validation to find optimal lambda value (alpha = 0 is ridge, alpha = 1 is lasso)
cv_model <- cv.glmnet(predictors, profit, alpha = 0)
#find optimal lambda value that minimizes test MSE
best_lambda <- cv_model$lambda.min
best_lambda
## [1] 12.26381
#produce plot of test MSE by lambda value
plot(cv_model)
Find coefficients of the ridge model:
best_model <- glmnet(predictors, profit, alpha = 0, lambda = best_lambda)
coef(best_model)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 56.3400048
## Sales 0.1795266
## Quantity -2.2066568
## Segment 1.5221017
## Discount -87.1680050
## Processing.Time 0.3327554
## Ship.Mode -1.2538208
## Region -7.3759030
## Sub.Category -4.3348805
## Gross.Margin 0.8413572
Here we must look for those features whose coefficients are close to zero. We see that Ship.Mode, Processing.Time and Segment are, but also Gross.Margin and even more Sales. However, before discarding Sales we will check interaction and nonlinear terms.
Next we are going to check some interaction terms that could potentially have a positive impact on the performance of the model. To determine if an interaction term should be included or not we will check that its p-value is lower than 0.05, its VIF is not bigger than 3 and that there is an increase in the adjusted \(R^2\). Finally we will test with ANOVA if the inclusion of the term results in a better model.
# Interactions to check:
# Sales:Quantity -----> No improvement in the model
# Sales:Discount -------------> Increases a lot the adjusted R squared, good. Vif also okay
# Ship.Mode:Processing.Time --------------- No improvement
# Sales:Sub.Category -------- Adjusted R squared increases but vif factors are not good.
# Sales:Segment -----------------------> Slight increase in R^2 and vif factors, but still okay. Won't take, since doesn't improve R so much.
# Region:Sales ----> Increases a bit the adjusted R squared but also vif factors, so better not to take.
lm_mult.fit <- lm(Profit ~ Sales + Quantity + Segment + Gross.Margin + Processing.Time + Sales:Discount + Ship.Mode + Region + Sub.Category , data = train)
summary(lm_mult.fit)
##
## Call:
## lm(formula = Profit ~ Sales + Quantity + Segment + Gross.Margin +
## Processing.Time + Sales:Discount + Ship.Mode + Region + Sub.Category,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4434.6 -14.6 2.7 26.4 2854.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.887e+00 7.285e+00 1.357 0.17473
## Sales 3.833e-01 2.823e-03 135.787 < 2e-16 ***
## Quantity -5.556e+00 6.047e-01 -9.187 < 2e-16 ***
## SegmentCorporate 6.439e-01 3.004e+00 0.214 0.83028
## SegmentHome Office 4.331e+00 3.621e+00 1.196 0.23176
## Gross.Margin 5.414e-01 3.398e-02 15.933 < 2e-16 ***
## Processing.Time -8.140e-01 1.332e+00 -0.611 0.54118
## Ship.ModeSame Day -5.144e+00 7.171e+00 -0.717 0.47325
## Ship.ModeSecond Class -1.254e+01 4.701e+00 -2.668 0.00764 **
## Ship.ModeStandard Class -3.029e+00 5.321e+00 -0.569 0.56922
## RegionEast -9.965e+00 3.785e+00 -2.633 0.00848 **
## RegionSouth -1.378e+01 4.367e+00 -3.155 0.00161 **
## RegionWest -1.049e+01 3.746e+00 -2.800 0.00512 **
## Sub.CategoryAppliances 3.206e-01 7.795e+00 0.041 0.96720
## Sub.CategoryArt 1.111e+01 6.650e+00 1.670 0.09495 .
## Sub.CategoryBinders 4.246e+01 5.950e+00 7.137 1.04e-12 ***
## Sub.CategoryBookcases -7.083e+01 1.001e+01 -7.079 1.58e-12 ***
## Sub.CategoryChairs -4.353e+01 7.209e+00 -6.038 1.63e-09 ***
## Sub.CategoryCopiers 3.166e+02 1.724e+01 18.365 < 2e-16 ***
## Sub.CategoryEnvelopes 1.144e+01 9.621e+00 1.190 0.23424
## Sub.CategoryFasteners 1.172e+01 1.030e+01 1.138 0.25495
## Sub.CategoryFurnishings 5.641e+00 6.360e+00 0.887 0.37515
## Sub.CategoryLabels 7.911e+00 8.318e+00 0.951 0.34154
## Sub.CategoryMachines -1.017e+02 1.353e+01 -7.516 6.25e-14 ***
## Sub.CategoryPaper 1.009e+01 5.974e+00 1.689 0.09118 .
## Sub.CategoryPhones -9.228e+00 6.503e+00 -1.419 0.15595
## Sub.CategoryStorage -3.661e+01 6.534e+00 -5.603 2.17e-08 ***
## Sub.CategorySupplies -6.427e+01 1.047e+01 -6.140 8.63e-10 ***
## Sub.CategoryTables -1.093e+02 8.950e+00 -12.210 < 2e-16 ***
## Sales:Discount -1.166e+00 9.843e-03 -118.439 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 117.5 on 7965 degrees of freedom
## Multiple R-squared: 0.7683, Adjusted R-squared: 0.7675
## F-statistic: 910.8 on 29 and 7965 DF, p-value: < 2.2e-16
vif(lm_mult.fit)
## GVIF Df GVIF^(1/(2*Df))
## Sales 1.999385 1 1.413996
## Quantity 1.058654 1 1.028909
## Segment 1.008097 2 1.002018
## Gross.Margin 1.421714 1 1.192356
## Processing.Time 3.141738 1 1.772495
## Ship.Mode 3.176082 3 1.212408
## Region 1.110950 3 1.017691
## Sub.Category 1.723923 16 1.017164
## Sales:Discount 1.708649 1 1.307153
Based on the results for different models we keep the interaction between Sales and Discount. It makes the adjusted \(R^2\) increase significantly while keeping a good p-value and VIF. With the inclusion of this term, we realize that having Gross.Margin instead of Discount improves the adjusted \(R^2\), thus we use it instead.
We can use ANOVA test to study if the inclusion of Sales:Discount results in a superior model as compared to the previous one.
The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior.
lm_mult.fit2 <- lm(Profit ~ Sales + Sales:Discount + Quantity + Segment + Gross.Margin + Processing.Time + Ship.Mode + Region + Sub.Category , data = train)
lm_mult.fit <- lm(Profit ~ Sales + Quantity + Segment + Discount + Processing.Time + Ship.Mode + Region + Sub.Category , data = train)
anova(lm_mult.fit , lm_mult.fit2)
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
7966 | 303294573 | NA | NA | NA | NA |
7965 | 110009295 | 1 | 193285278 | 13994.43 | 0 |
The big F and zero p-value are strong evidences of the fact that including the interaction term between Sales and Discount makes the model better.
Next step is to check nonlinear terms. Similar to the previous case, we go over different nonlinear term candidates to include in our model and base the choice on the p-value, VIF and the adjusted \(R^2\).
# Quantity^2 -----> Good pvalue but high vif factor and no improvement of R
# Discount^2 ------------> high pvalue
# Poly(Sales, 3) ------------> good pvalue and increase in R, no problem with vif
# Poly(Quantity,5) ------------> Nope, only Quantity^2 p-value is good, and we have seen before vif are high
# log(sales) ------------> adding this term to Sales has a good p value, but no improvement on R. Replacing the Sales by log(Sales) very bad R^2
# sqrt(Sales) ------------> same as with log(sales)
# log(Quantity) ------------> same problem with log(sales) but anova shows is not that good
# sqrt(Quantity) ------------> same as log(quantity) but with pvalue in anova 3%
#sqrt(Gross.Margin) ------------> good p-value, good increase in adjusted R, good vif
lm_mult.fit <- lm(Profit ~ poly(Sales, 3) + Sales:Discount + Segment + Gross.Margin + Processing.Time + Ship.Mode + Region + Sub.Category , data = train)
summary(lm_mult.fit)
##
## Call:
## lm(formula = Profit ~ poly(Sales, 3) + Sales:Discount + Segment +
## Gross.Margin + Processing.Time + Ship.Mode + Region + Sub.Category,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3665.2 -14.0 -1.3 16.1 2158.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.280e+01 6.012e+00 13.772 < 2e-16 ***
## poly(Sales, 3)1 2.354e+04 1.532e+02 153.673 < 2e-16 ***
## poly(Sales, 3)2 6.204e+03 1.185e+02 52.365 < 2e-16 ***
## poly(Sales, 3)3 3.719e+02 1.221e+02 3.045 0.002334 **
## SegmentCorporate -3.400e-01 2.603e+00 -0.131 0.896086
## SegmentHome Office 4.285e+00 3.139e+00 1.365 0.172272
## Gross.Margin 5.318e-01 2.964e-02 17.941 < 2e-16 ***
## Processing.Time -1.347e+00 1.154e+00 -1.167 0.243180
## Ship.ModeSame Day -2.119e+00 6.213e+00 -0.341 0.733124
## Ship.ModeSecond Class -1.086e+01 4.073e+00 -2.666 0.007695 **
## Ship.ModeStandard Class -2.146e+00 4.610e+00 -0.465 0.641642
## RegionEast -7.443e+00 3.280e+00 -2.269 0.023287 *
## RegionSouth -1.287e+01 3.784e+00 -3.402 0.000673 ***
## RegionWest -9.183e+00 3.245e+00 -2.830 0.004672 **
## Sub.CategoryAppliances 3.748e+00 6.755e+00 0.555 0.579030
## Sub.CategoryArt -7.214e+00 5.798e+00 -1.244 0.213408
## Sub.CategoryBinders 3.395e+01 5.171e+00 6.566 5.47e-11 ***
## Sub.CategoryBookcases -2.817e+01 8.721e+00 -3.231 0.001240 **
## Sub.CategoryChairs -6.822e-01 6.322e+00 -0.108 0.914074
## Sub.CategoryCopiers 3.853e+02 1.493e+01 25.801 < 2e-16 ***
## Sub.CategoryEnvelopes -1.812e+00 8.356e+00 -0.217 0.828351
## Sub.CategoryFasteners -1.047e+01 8.951e+00 -1.170 0.242108
## Sub.CategoryFurnishings -4.643e+00 5.523e+00 -0.841 0.400585
## Sub.CategoryLabels -1.059e+01 7.240e+00 -1.463 0.143459
## Sub.CategoryMachines -9.509e-01 1.184e+01 -0.080 0.935989
## Sub.CategoryPaper -5.266e+00 5.212e+00 -1.010 0.312323
## Sub.CategoryPhones 1.390e+01 5.653e+00 2.459 0.013940 *
## Sub.CategoryStorage -3.018e+01 5.665e+00 -5.328 1.02e-07 ***
## Sub.CategorySupplies -6.596e+01 9.075e+00 -7.269 3.98e-13 ***
## Sub.CategoryTables -4.441e+01 7.872e+00 -5.641 1.75e-08 ***
## Sales:Discount -1.379e+00 1.066e-02 -129.351 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 101.8 on 7964 degrees of freedom
## Multiple R-squared: 0.8261, Adjusted R-squared: 0.8254
## F-statistic: 1261 on 30 and 7964 DF, p-value: < 2.2e-16
vif(lm_mult.fit)
## GVIF Df GVIF^(1/(2*Df))
## poly(Sales, 3) 3.664349 3 1.241651
## Segment 1.008911 2 1.002220
## Gross.Margin 1.441088 1 1.200453
## Processing.Time 3.141824 1 1.772519
## Ship.Mode 3.174908 3 1.212333
## Region 1.110712 3 1.017654
## Sub.Category 2.101800 16 1.023484
## Sales:Discount 2.667644 1 1.633292
Taking second and third order terms of sales increases the adjusted \(R^2\). It shows also a low p-value and no problem with multicollinearity.
Let’s use ANOVA test to study Poly(Sales, 3). The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the model including the nonlinear term is superior.
lm_mult.fit2 <- lm(Profit ~ poly(Sales, 3) + Quantity + Segment + Gross.Margin + Processing.Time + Ship.Mode + Region + Sub.Category , data = train)
lm_mult.fit <- lm(Profit ~ Sales + Quantity + Segment + Gross.Margin + Processing.Time + Ship.Mode + Region + Sub.Category , data = train)
anova(lm_mult.fit , lm_mult.fit2)
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
7966 | 303755009 | NA | NA | NA | NA |
7964 | 255905682 | 2 | 47849327 | 744.5557 | 0 |
There is clear evidence that the model which includes the nonlinearity on Sales performs better than that that doesn’t include it.
Based on previous analysis of Lasso and Ridge regression, interaction terms and nonlinear terms let’s create a new model selecting the most important variables and train it using 5-fold cross validation.
lm_mult = train(
form = Profit ~ poly(Sales,3) + Sales:Discount + Sub.Category + Gross.Margin,
data = train,
trControl = trainControl(method = "cv", number = 5),
method = "lm")
summary(lm_mult)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3672.2 -12.9 -2.8 15.5 2158.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.808e+01 4.151e+00 16.401 < 2e-16 ***
## `poly(Sales, 3)1` 2.355e+04 1.532e+02 153.690 < 2e-16 ***
## `poly(Sales, 3)2` 6.208e+03 1.186e+02 52.351 < 2e-16 ***
## `poly(Sales, 3)3` 3.695e+02 1.222e+02 3.025 0.002496 **
## Sub.CategoryAppliances 2.878e+00 6.755e+00 0.426 0.670130
## Sub.CategoryArt -7.441e+00 5.802e+00 -1.283 0.199703
## Sub.CategoryBinders 3.314e+01 5.168e+00 6.413 1.51e-10 ***
## Sub.CategoryBookcases -2.885e+01 8.719e+00 -3.308 0.000943 ***
## Sub.CategoryChairs -1.019e+00 6.327e+00 -0.161 0.872121
## Sub.CategoryCopiers 3.862e+02 1.495e+01 25.838 < 2e-16 ***
## Sub.CategoryEnvelopes -1.719e+00 8.357e+00 -0.206 0.837063
## Sub.CategoryFasteners -9.356e+00 8.957e+00 -1.045 0.296275
## Sub.CategoryFurnishings -5.155e+00 5.527e+00 -0.933 0.351021
## Sub.CategoryLabels -9.934e+00 7.245e+00 -1.371 0.170370
## Sub.CategoryMachines -9.702e-01 1.184e+01 -0.082 0.934701
## Sub.CategoryPaper -4.545e+00 5.214e+00 -0.872 0.383408
## Sub.CategoryPhones 1.360e+01 5.657e+00 2.404 0.016248 *
## Sub.CategoryStorage -3.049e+01 5.668e+00 -5.379 7.69e-08 ***
## Sub.CategorySupplies -6.734e+01 9.080e+00 -7.416 1.33e-13 ***
## Sub.CategoryTables -4.536e+01 7.873e+00 -5.762 8.62e-09 ***
## Gross.Margin 5.024e-01 2.833e-02 17.735 < 2e-16 ***
## `Sales:Discount` -1.379e+00 1.066e-02 -129.363 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 102 on 7973 degrees of freedom
## Multiple R-squared: 0.8254, Adjusted R-squared: 0.825
## F-statistic: 1795 on 21 and 7973 DF, p-value: < 2.2e-16
The feature Region has been finally dropped because it was found to have a high p-value in the new model. The F-statistic of this model is more than 10 orders of magnitudes bigger than that of the initial model which means that the predictors in our final model are very related to the response. The final adjusted \(R^2\) is equal to 0.83, which is a quite high value and reflects the power of the model.
Check for multicollinearity:
vif(lm_mult$finalModel)
## `poly(Sales, 3)1` `poly(Sales, 3)2` `poly(Sales, 3)3`
## 2.257828 1.352624 1.435578
## Sub.CategoryAppliances Sub.CategoryArt Sub.CategoryBinders
## 1.561027 1.901363 2.657537
## Sub.CategoryBookcases Sub.CategoryChairs Sub.CategoryCopiers
## 1.307698 1.744446 1.194897
## Sub.CategoryEnvelopes Sub.CategoryFasteners Sub.CategoryFurnishings
## 1.303731 1.254610 2.040373
## Sub.CategoryLabels Sub.CategoryMachines Sub.CategoryPaper
## 1.458071 1.279394 2.467364
## Sub.CategoryPhones Sub.CategoryStorage Sub.CategorySupplies
## 1.973849 1.940746 1.235896
## Sub.CategoryTables Gross.Margin `Sales:Discount`
## 1.455063 1.313200 2.663086
There is no multicollinearity among the final chosen features. Let’s then analyze the diagnostic plots.
par(mfrow=c(2,2))
plot(lm_mult$finalModel)
Fitted vs Residual graph
The red line is a little further from zero than in the initial model, indicating that there is less linearity, especially for values far to the right and far to the left. The spread of the residuals is not approximately the same across the x axis, indicating the presence of heteroskedasticity.
Normal Q-Q Plot
Very similar to the plot for the initial model. But as said before, this would not represent a big problem, since given the size of our test set, the Central Limit Theorem can be invoked.
Scale-Location
The red line is far from horizontal, confirming the presence of heteroskedasticity.
Residuals vs Leverage
The spread of standardized residuals changes as a function of leverage, being more spread for high leverage points. This is another proof of heteroskedasticity. There are three outliers that are found beyond Cook’s distance.
In the face of heteroskedasticity, we still have unbiased parameter estimates but we can’t trust the values of their variance.
Finally let’s use the final model to predict on the test set and calculate the RMSE.
pred <- predict(lm_mult, newdata = test)
rmse <- sqrt(sum((pred - test$Profit)^2)/length(test$Profit))
normalized_rmse <- 100 * rmse/(max(train$Profit)-min(train$Profit))
cat(paste("RMSE =", rmse))
## RMSE = 91.742018621247
cat(paste("\nNormalized RMSE =", normalized_rmse, "%"))
##
## Normalized RMSE = 0.611615333095335 %
Even though a RMSE of 92 may seem high we need to look at the range of our response variable and normalize the RMSE to get a better sense of the meaning of the MSE. The normalized RMSE is 0.6%, which is a low enough error for the purposes of our model.
Finally let’s have a look at the Profit vs Prediction plot. As expected all values are around a slope 1 line.
Profit <- test$Profit
Prediction <- pred
plot(Profit, Prediction)
In this section we will address a binary classification problem. The idea is to make the column Profit a binary column, so that all positive profits get label Positive and all negative profits get label Negative. By doing this we can train a model that is able to predict if a purchase will result in positive or negative profit based on features such as Region, State, City, Sub-Category, Segment, Ship.Mode, Sales, Quantity, Discount and Process time.
As a first step let us define a function to compute the most typical evaluation metrics: accuracy, precision, recall and F1 score.
calc_metrics <- function(cm) {
n = sum(cm) # number of instances
nc = nrow(cm) # number of classes
diag = diag(cm) # number of correctly classified instances per class
rowsums = rowSums(cm) # number of instances per class
colsums = colSums(cm) # number of predictions per class
p = rowsums / n # distribution of instances over the actual classes
q = colsums / n # distribution of instances over the predicted classes
accuracy = sum(diag) / n
precision = diag / colsums
recall = diag / rowsums
f_1 = 2 * precision * recall / (precision + recall)
print(paste("Accuracy", accuracy))
print("Precision: ")
print(precision)
print("Recall: ")
print(recall)
print("F1 score: ")
print(f_1)
}
The column Profit is then modified and the discrete value features are converted to factors.
df_log <- df[, c("Profit","Discount", "Quantity", "Sales", "Segment", "Ship.Mode", "Region", "Order.Month", "Processing.Time", "Sub.Category", "Order.Year")]
df_log["Profit"]<-replace(df_log["Profit"], df["Profit"] < 0, "Negative")
df_log["Profit"]<-replace(df_log["Profit"], df["Profit"] >= 0, "Positive")
df_log[,"Profit"]<- as.factor(df_log[,"Profit"])
df_log[,"Segment"]<- as.factor(df_log[,"Segment"])
df_log[,"Ship.Mode"]<- as.factor(df_log[,"Ship.Mode"])
df_log[,"Region"]<- as.factor(df_log[,"Region"])
df_log[,"Sub.Category"]<- as.factor(df_log[,"Sub.Category"])
Let’s check for unbalance in the data.
table(df_log$Profit)
##
## Negative Positive
## 1871 8123
prop.table(table(df_log$Profit))
##
## Negative Positive
## 0.1872123 0.8127877
contrasts(df_log$Profit)
## Positive
## Negative 0
## Positive 1
As we can see there is a moderate unbalance in our data. We are going to tackle this problem by undersampling and oversampling accordingly. But first let’s see what results we get with the unbalanced data. The train-test split is done taking as training all examples corresponding to the years 2014, 2015 and 2016, leaving 2017 as the test set.
train <- (df_log$Order.Year < 2017)
df_trn <- df_log[train, ]
df_tst <- df_log[!train, ]
dim(df_tst)
## [1] 3312 11
Profit.2017 <- df_log$Profit[!train]
Before training the model feature selection is done to avoid overfitting and thus ensure better results. The methodology followed to select the best features is based on the comparison of the AIC and the McFadden’s \(R^{2}\) for different trained linear models obtained by suppressing features one at a time.
The Akaike Information Criterion (AIC) is a mathematical method for evaluating how well a model fits the data it was generated from. We use it to compare different models and determine which one fits the data better. The best-fit model according to AIC is the one that explains the greatest amount of variation using the fewest possible independent variables. It is based on the number of independent variables K and the log-likelihood estimate L according to the expression:
\[ AIC = 2K - 2ln(L) \]
Usually, when comparing two models, a difference of more than 2 in their AIC value is enough to say the model with the lower AIC is better.
Since GLM models use a maximum likelihood estimator, there is no minimization of the squared error and hence no R is calculated. There are a large number of pseudo-\(R^{2}\) for GLMs, among which the most popular is McFadden’s \(R^2\). It is calculated as:
\[R^{2} = 1 - \frac{Residual Deviance}{Null Deviance}\] We will therefore look for models that present a bigger McFadden’s \(R^{2}\).
# Let's try now doing the same but with less variables (The ones with higher p value in the full fitted model)
# glm.fits1 <- glm(Profit ~ Discount + Quantity + Sales + Ship.Mode + Sub.Category , data = df_trn, family = binomial)
# glm.fits2 <- glm(Profit ~ Discount + Quantity + Sales + Ship.Mode + Processing.Time , data = df_trn, family = binomial)
# glm.fits3 <- glm(Profit ~ Discount + Sales + Sub.Category , data = df_trn, family = binomial)
# glm.fits4 <- glm(Profit ~ Discount + Quantity + Sales + Ship.Mode + Region + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits5 <- glm(Profit ~ Discount + Sales + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits6 <- glm(Profit ~ Discount + Quantity + Sales + Region + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits7 <- glm(Profit ~ Discount + Quantity + Sales + Ship.Mode + Region + Order.Month + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits8 <- glm(Profit ~ Discount + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits9 <- glm(Profit ~ Discount + Sales + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits10 <- glm(Profit ~ Quantity + Sales + Ship.Mode + Region + Order.Month + Processing.Time + Sub.Category , data = df_trn, family = binomial)
glm.fits1 <- glm(Profit ~ Discount + Quantity + Sales + Segment + Ship.Mode + Region + Order.Month + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits2 <- glm(Profit ~ Discount + Sales + Ship.Mode + Region + Processing.Time , data = df_trn, family = binomial) # Subcategory is important
glm.fits3 <- glm(Profit ~ Discount + Sales + Ship.Mode + Region + Sub.Category , data = df_trn, family = binomial)
glm.fits4 <- glm(Profit ~ Discount + Sales + Ship.Mode + Region + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits5 <- glm(Profit ~ Discount + Sales + Ship.Mode + Processing.Time + Sub.Category , data = df_trn, family = binomial) # Region seems to be important
# glm.fits6 <- glm(Profit ~ Discount + Sales + Region + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits7 <- glm(Profit ~ Discount + Sales + Ship.Mode + Region + Processing.Time + Sub.Category , data = df_trn, family = binomial)
glm.fits8 <- glm(Profit ~ Discount + Ship.Mode + Region + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits9 <- glm(Profit ~ Discount + Sales + Ship.Mode + Region + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits10 <- glm(Profit ~ Quantity + Sales + Ship.Mode + Region + Order.Month + Processing.Time + Sub.Category , data = df_trn, family = binomial) # Discount is very important
print(paste("AIC1: ", extractAIC(glm.fits1)[2]))
## [1] "AIC1: 1540.52892268126"
print(paste("McFadden's R^2 ", with(summary(glm.fits1), 1 - deviance/null.deviance)))
## [1] "McFadden's R^2 0.770234940945986"
# print(paste("AIC2: ", extractAIC(glm.fits2)[2]))
# print(paste("McFadden's R^2 ", with(summary(glm.fits2), 1 - deviance/null.deviance)))
print(paste("AIC3: ", extractAIC(glm.fits3)[2]))
## [1] "AIC3: 1537.68458296899"
print(paste("McFadden's R^2 ", with(summary(glm.fits3), 1 - deviance/null.deviance)))
## [1] "McFadden's R^2 0.769124445444416"
print(paste("AIC4: ", extractAIC(glm.fits4)[2]))
## [1] "AIC4: 1534.90359508884"
print(paste("McFadden's R^2 ", with(summary(glm.fits4), 1 - deviance/null.deviance)))
## [1] "McFadden's R^2 0.769866412687196"
# print(paste("AIC5: ", extractAIC(glm.fits5)[2]))
# print(paste("McFadden's R^2 ", with(summary(glm.fits5), 1 - deviance/null.deviance)))
# print(paste("AIC6: ", extractAIC(glm.fits6)[2]))
# print(paste("McFadden's R^2 ", with(summary(glm.fits6), 1 - deviance/null.deviance)))
# print(paste("AIC7: ", extractAIC(glm.fits7)[2]))
# print(paste("McFadden's R^2 ", with(summary(glm.fits7), 1 - deviance/null.deviance)))
print(paste("AIC8: ", extractAIC(glm.fits8)[2]))
## [1] "AIC8: 1534.24293430454"
print(paste("McFadden's R^2 ", with(summary(glm.fits8), 1 - deviance/null.deviance)))
## [1] "McFadden's R^2 0.769658559028093"
# print(paste("AIC9: ", extractAIC(glm.fits9)[2]))
# print(paste("McFadden's R^2 ", with(summary(glm.fits9), 1 - deviance/null.deviance)))
# print(paste("AIC10: ", extractAIC(glm.fits10)[2]))
# print(paste("McFadden's R^2 ", with(summary(glm.fits10), 1 - deviance/null.deviance)))
Best features found are: Discount, Sales, Sub.Category, Processing.Time and Region. We choose only those five to keep the problem simple and because the addition of other features does not improve the performance of the model.
Logistic regression is the standard base classification algorithm when having linearly separable data. That’s why we’ll use it first to train our model. Furthermore, the use of 5-fold cross validation will give us better estimations of the performance of the model.
glm_mod = train(
form = Profit ~ Discount + Sales + Sub.Category + Region + Processing.Time,
data = df_trn,
trControl = trainControl(method = "cv", number = 5),
method = "glm",
family = "binomial"
)
glm_mod
## Generalized Linear Model
##
## 6682 samples
## 5 predictor
## 2 classes: 'Negative', 'Positive'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 5346, 5345, 5346, 5345, 5346
## Resampling results:
##
## Accuracy Kappa
## 0.9446272 0.8067708
summary(glm_mod)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.83125 0.00000 0.00118 0.05005 2.60590
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.057e+00 4.490e-01 15.720 < 2e-16 ***
## Discount -3.012e+01 1.846e+00 -16.316 < 2e-16 ***
## Sales 1.110e-04 9.633e-05 1.153 0.24906
## Sub.CategoryAppliances 8.530e+00 8.265e+00 1.032 0.30205
## Sub.CategoryArt 1.845e+01 6.644e+02 0.028 0.97784
## Sub.CategoryBinders 7.067e+00 1.686e+00 4.191 2.78e-05 ***
## Sub.CategoryBookcases -9.599e-01 3.090e-01 -3.107 0.00189 **
## Sub.CategoryChairs 1.937e-01 2.369e-01 0.818 0.41338
## Sub.CategoryCopiers 2.325e+01 1.964e+03 0.012 0.99056
## Sub.CategoryEnvelopes 1.851e+01 1.116e+03 0.017 0.98677
## Sub.CategoryFasteners 9.761e-01 3.757e-01 2.598 0.00937 **
## Sub.CategoryFurnishings 1.167e+00 2.866e-01 4.073 4.63e-05 ***
## Sub.CategoryLabels 1.840e+01 9.519e+02 0.019 0.98458
## Sub.CategoryMachines 3.804e+00 7.249e-01 5.248 1.54e-07 ***
## Sub.CategoryPaper 1.843e+01 4.990e+02 0.037 0.97054
## Sub.CategoryPhones 2.036e+00 2.892e-01 7.039 1.93e-12 ***
## Sub.CategoryStorage -9.076e-01 2.056e-01 -4.413 1.02e-05 ***
## Sub.CategorySupplies -6.440e-01 3.170e-01 -2.031 0.04223 *
## Sub.CategoryTables -2.378e-01 3.331e-01 -0.714 0.47539
## RegionEast -2.676e-01 1.915e-01 -1.398 0.16222
## RegionSouth 6.827e-02 2.199e-01 0.310 0.75620
## RegionWest -5.004e-01 2.048e-01 -2.443 0.01456 *
## Processing.Time -2.360e-02 3.751e-02 -0.629 0.52921
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6443.7 on 6681 degrees of freedom
## Residual deviance: 1495.2 on 6659 degrees of freedom
## AIC: 1541.2
##
## Number of Fisher Scoring iterations: 19
Check McFadden’s \(R^2\).
print("McFadden's R^2")
## [1] "McFadden's R^2"
with(summary(glm_mod), 1 - deviance/null.deviance)
## [1] 0.7679639
Predict and display confusion matrix.
glm.probs <- predict(glm_mod, df_tst)
levels = c("Positive", "Negative")
cm <- as.matrix(table(glm.probs , Profit.2017)[levels,levels])
cm
## Profit.2017
## glm.probs Positive Negative
## Positive 2647 146
## Negative 45 474
Calculate evaluation metrics:
calc_metrics(cm)
## [1] "Accuracy 0.942330917874396"
## [1] "Precision: "
## Positive Negative
## 0.9832838 0.7645161
## [1] "Recall: "
## Positive Negative
## 0.9477265 0.9132948
## [1] "F1 score: "
## Positive Negative
## 0.9651778 0.8323090
roc.curve(Profit.2017, glm.probs, plotit = F)
## Area under the curve (AUC): 0.874
Now we’ll balance the dataset to determine if the classification improves for balanced data. Two approaches are used: a synthetic method and a sampling method.
Library ROSE ( Random Over Sampling Examples) is used to generate artificial data based on sampling methods and smoothed bootstrap approach and has been proved to yield better results than traditional sampling methods in many situations. It is therefore a type of oversampling technique.
On the other hand, the package ovun.sample is used to perform oversampling and undersampling in one go. The results will be compared to the synthetic method approach.
balanced_data <- ROSE(Profit ~ ., data = df_trn, seed = 1)$data
table(balanced_data$Profit)
##
## Positive Negative
## 3368 3314
contrasts(balanced_data$Profit)
## Negative
## Positive 0
## Negative 1
We can see the data is now balanced with more or less the same amount of examples in each class. Now we train the model using 5-fold crossvalidation.
glm_fits = train(
form = Profit ~ Discount + Sales + Sub.Category + Region + Processing.Time,
data = balanced_data,
trControl = trainControl(method = "cv", number = 5),
method = "glm",
family = "binomial"
)
glm_fits
## Generalized Linear Model
##
## 6682 samples
## 5 predictor
## 2 classes: 'Positive', 'Negative'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 5346, 5347, 5345, 5345, 5345
## Resampling results:
##
## Accuracy Kappa
## 0.9033228 0.8066274
summary(glm_fits)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.05236 -0.22592 -0.00001 0.19759 3.03605
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.533e+00 2.029e-01 -12.485 < 2e-16 ***
## Discount 1.518e+01 4.529e-01 33.516 < 2e-16 ***
## Sales -7.874e-05 5.664e-05 -1.390 0.164424
## Sub.CategoryAppliances -4.022e+00 6.470e-01 -6.217 5.08e-10 ***
## Sub.CategoryArt -1.897e+01 5.137e+02 -0.037 0.970548
## Sub.CategoryBinders -3.173e+00 2.493e-01 -12.725 < 2e-16 ***
## Sub.CategoryBookcases 1.346e+00 2.583e-01 5.212 1.86e-07 ***
## Sub.CategoryChairs 5.621e-01 1.745e-01 3.220 0.001280 **
## Sub.CategoryCopiers -2.062e+01 1.486e+03 -0.014 0.988929
## Sub.CategoryEnvelopes -1.912e+01 8.761e+02 -0.022 0.982589
## Sub.CategoryFasteners -1.091e+00 3.314e-01 -3.293 0.000991 ***
## Sub.CategoryFurnishings -1.109e+00 2.212e-01 -5.014 5.32e-07 ***
## Sub.CategoryLabels -1.915e+01 7.622e+02 -0.025 0.979959
## Sub.CategoryMachines -1.377e+00 3.750e-01 -3.674 0.000239 ***
## Sub.CategoryPaper -1.904e+01 3.979e+02 -0.048 0.961834
## Sub.CategoryPhones -1.073e+00 1.820e-01 -5.898 3.69e-09 ***
## Sub.CategoryStorage 9.205e-01 1.725e-01 5.335 9.54e-08 ***
## Sub.CategorySupplies 4.291e-01 2.827e-01 1.518 0.128974
## Sub.CategoryTables 1.313e+00 2.505e-01 5.244 1.57e-07 ***
## RegionEast 9.319e-02 1.345e-01 0.693 0.488519
## RegionSouth -5.819e-01 1.617e-01 -3.598 0.000320 ***
## RegionWest -4.679e-01 1.336e-01 -3.501 0.000463 ***
## Processing.Time 4.122e-02 2.522e-02 1.634 0.102155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9262.8 on 6681 degrees of freedom
## Residual deviance: 2810.6 on 6659 degrees of freedom
## AIC: 2856.6
##
## Number of Fisher Scoring iterations: 18
Check McFadden’s \(R^2\).
with(summary(glm_fits), 1 - deviance/null.deviance)
## [1] 0.6965676
Predict and display confusion matrix.
glm.probs <- predict(glm_fits , df_tst)
levels = c("Positive", "Negative")
cm <- as.matrix(table(glm.probs, Profit.2017)[levels,levels])
cm
## Profit.2017
## glm.probs Positive Negative
## Positive 2470 28
## Negative 222 592
Calculate evaluation metrics:
calc_metrics(cm)
## [1] "Accuracy 0.92451690821256"
## [1] "Precision: "
## Positive Negative
## 0.9175334 0.9548387
## [1] "Recall: "
## Positive Negative
## 0.9887910 0.7272727
## [1] "F1 score: "
## Positive Negative
## 0.9518304 0.8256625
roc.curve(Profit.2017, glm.probs, plotit = F)
## Area under the curve (AUC): 0.936
Let’s now see how the package ovun.sample performs compared to the previous approach. Here, we can choose to do only oversampling, only undersampling or a mix of the two. We choose a mix of the two by imposing p = 0.5, so that we end up with a 50% probability of positive class in the resulting balanced data.
balanced_data <- ovun.sample(Profit ~ ., data = df_trn, method = "both", p=0.5, N = nrow(df_trn), seed = 1)$data # Can play with parameter p: probability of positive class in newly generated sample
table(balanced_data$Profit)
##
## Positive Negative
## 3368 3314
Train the model using 5-fold crossvalidation.
glm_fits = train(
form = Profit ~ Discount + Sales + Sub.Category + Region + Processing.Time,
data = balanced_data,
trControl = trainControl(method = "cv", number = 5),
method = "glm",
family = "binomial"
)
glm_fits
## Generalized Linear Model
##
## 6682 samples
## 5 predictor
## 2 classes: 'Positive', 'Negative'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 5346, 5346, 5345, 5345, 5346
## Resampling results:
##
## Accuracy Kappa
## 0.9313078 0.8626549
summary(glm_fits)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2821 -0.0394 0.0000 0.0419 2.3534
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.873e+00 3.919e-01 -14.986 < 2e-16 ***
## Discount 3.131e+01 1.602e+00 19.544 < 2e-16 ***
## Sales -2.208e-04 6.687e-05 -3.302 0.000959 ***
## Sub.CategoryAppliances -9.352e+00 9.397e+00 -0.995 0.319657
## Sub.CategoryArt -1.884e+01 4.906e+02 -0.038 0.969369
## Sub.CategoryBinders -8.130e+00 1.682e+00 -4.832 1.35e-06 ***
## Sub.CategoryBookcases 1.065e+00 3.027e-01 3.518 0.000435 ***
## Sub.CategoryChairs -2.451e-01 2.008e-01 -1.220 0.222317
## Sub.CategoryCopiers -2.307e+01 1.334e+03 -0.017 0.986201
## Sub.CategoryEnvelopes -1.890e+01 8.524e+02 -0.022 0.982307
## Sub.CategoryFasteners -8.131e-01 3.111e-01 -2.614 0.008952 **
## Sub.CategoryFurnishings -1.136e+00 2.357e-01 -4.819 1.44e-06 ***
## Sub.CategoryLabels -1.887e+01 7.421e+02 -0.025 0.979708
## Sub.CategoryMachines -3.661e+00 5.618e-01 -6.517 7.17e-11 ***
## Sub.CategoryPaper -1.877e+01 3.826e+02 -0.049 0.960866
## Sub.CategoryPhones -1.744e+00 2.066e-01 -8.438 < 2e-16 ***
## Sub.CategoryStorage 1.111e+00 1.928e-01 5.764 8.21e-09 ***
## Sub.CategorySupplies 4.545e-01 3.105e-01 1.464 0.143225
## Sub.CategoryTables -4.067e-02 3.001e-01 -0.136 0.892200
## RegionEast 3.674e-01 1.818e-01 2.021 0.043282 *
## RegionSouth -2.246e-01 1.889e-01 -1.189 0.234271
## RegionWest 5.845e-01 1.772e-01 3.298 0.000972 ***
## Processing.Time 2.264e-02 3.320e-02 0.682 0.495262
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9262.8 on 6681 degrees of freedom
## Residual deviance: 1914.9 on 6659 degrees of freedom
## AIC: 1960.9
##
## Number of Fisher Scoring iterations: 18
Check McFadden’s \(R^2\).
with(summary(glm_fits), 1 - deviance/null.deviance)
## [1] 0.7932703
Predict and display confusion matrix.
glm.probs <- predict(glm_fits , df_tst)
levels = c("Positive", "Negative")
cm <- as.matrix(table(glm.probs, Profit.2017)[levels,levels])
cm
## Profit.2017
## glm.probs Positive Negative
## Positive 2472 33
## Negative 220 587
Calculate evaluation metrics:
calc_metrics(cm)
## [1] "Accuracy 0.923611111111111"
## [1] "Precision: "
## Positive Negative
## 0.9182764 0.9467742
## [1] "Recall: "
## Positive Negative
## 0.9868263 0.7273854
## [1] "F1 score: "
## Positive Negative
## 0.9513181 0.8227050
roc.curve(Profit.2017, glm.probs, plotit = F)
## Area under the curve (AUC): 0.933
Let’s do a quick comparison of the results for the unbalanced logistic regression and the two approaches followed to balance the data.
Method | Unbalanced | ROSE | Ovun.sample |
---|---|---|---|
AIC | 1541.20 | 2856.60 | 1960.90 |
McFadden’s R^2 | 0.77 | 0.70 | 0.79 |
Accuracy | 0.94 | 0.92 | 0.92 |
Precision | 0.98 | 0.92 | 0.92 |
Recall | 0.95 | 0.99 | 0.98 |
F1 score | 0.97 | 0.95 | 0.95 |
AUC | 0.87 | 0.94 | 0.93 |
We can see the unbalanced model performs actually really well as compared to the other models. It’s AIC is the lowest, while the evaluation metrics are close to the models trained with balanced data. However, the AUC score is higher for balanced data.
Both Ovun.sample and ROSE have a similar performance as well, but the lower AIC and higher McFadden’s \(R^2\) would make us choose the former over the latter.
Next we will try a naive Bayes classifier to compare to our logistic regression model. We use the unbalanced data since it proved to perform well with logistic regression.
nb_mod = train(
form = Profit ~ Discount + Processing.Time + Sales + Region,
data = df_trn,
trControl = trainControl(method = "cv", number = 5),
method = "nb"
)
nb_mod
## Naive Bayes
##
## 6682 samples
## 4 predictor
## 2 classes: 'Negative', 'Positive'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 5346, 5346, 5345, 5345, 5346
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.9319076 0.7527527
## TRUE 0.9413348 0.7865320
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
## = 1.
Unlike the previous models, here we have dropped the feature Sub.Category given an error in the training. We think the reason is the completeness of the data for this feature, in the sense that some Sub-Categories have one or very few points.
Predict and display confusion matrix:
nb.probs <- predict(nb_mod, df_tst)
levels = c("Positive", "Negative")
cm <- as.matrix(table(nb.probs , Profit.2017)[levels,levels])
cm
## Profit.2017
## nb.probs Positive Negative
## Positive 2681 181
## Negative 11 439
calc_metrics(cm)
## [1] "Accuracy 0.942028985507246"
## [1] "Precision: "
## Positive Negative
## 0.9959138 0.7080645
## [1] "Recall: "
## Positive Negative
## 0.9367575 0.9755556
## [1] "F1 score: "
## Positive Negative
## 0.9654303 0.8205607
roc.curve(Profit.2017, nb.probs, plotit = F)
## Area under the curve (AUC): 0.852
The evaluation metrics are quite similar to those of logistic regression. Based on the slightly bigger AUC score of logistic regression we would choose that model over the naive Bayes classifier.
Here we will use linear discriminant analysis to classify and model the unbalanced data. We also want to determine if it performs better than logistic regression and naive Bayes.
lda_mod = train(
form = Profit ~ Discount + Sales + Region + Processing.Time + Sub.Category,
data = df_trn,
trControl = trainControl(method = "cv", number = 5),
method = "lda"
)
lda_mod
## Linear Discriminant Analysis
##
## 6682 samples
## 5 predictor
## 2 classes: 'Negative', 'Positive'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 5344, 5346, 5346, 5346, 5346
## Resampling results:
##
## Accuracy Kappa
## 0.9429813 0.7962111
Predict and display confusion matrix.
lda.probs <- predict(lda_mod, df_tst)
levels = c("Positive", "Negative")
cm <- as.matrix(table(lda.probs , Profit.2017)[levels,levels])
cm
## Profit.2017
## lda.probs Positive Negative
## Positive 2665 161
## Negative 27 459
Calculate evaluation metrics:
calc_metrics(cm)
## [1] "Accuracy 0.943236714975845"
## [1] "Precision: "
## Positive Negative
## 0.9899703 0.7403226
## [1] "Recall: "
## Positive Negative
## 0.9430290 0.9444444
## [1] "F1 score: "
## Positive Negative
## 0.9659297 0.8300181
roc.curve(Profit.2017, lda.probs, plotit = F)
## Area under the curve (AUC): 0.865
The evaluation metrics are as good as those for linear regression and Bayes classifier on unbalanced data.
Finally,for the sake of completeness we use quadratic discriminant analysis to train the classification model.
qda_mod = train(
form = Profit ~ Discount + Sales + Region + Processing.Time,
data = df_trn,
trControl = trainControl(method = "cv", number = 5),
method = "qda"
)
qda_mod
## Quadratic Discriminant Analysis
##
## 6682 samples
## 4 predictor
## 2 classes: 'Negative', 'Positive'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 5345, 5346, 5346, 5345, 5346
## Resampling results:
##
## Accuracy Kappa
## 0.9346014 0.7655044
In the same way as with the Bayes classifier, we have dropped the feature Sub.Category given an error in the training.
Predict and display confusion matrix:
qda.probs <- predict(qda_mod, df_tst)
levels = c("Positive", "Negative")
cm <- as.matrix(table(qda.probs , Profit.2017)[levels,levels])
cm
## Profit.2017
## qda.probs Positive Negative
## Positive 2656 179
## Negative 36 441
Calculate evaluation metrics:
calc_metrics(cm)
## [1] "Accuracy 0.935084541062802"
## [1] "Precision: "
## Positive Negative
## 0.9866270 0.7112903
## [1] "Recall: "
## Positive Negative
## 0.9368607 0.9245283
## [1] "F1 score: "
## Positive Negative
## 0.9611001 0.8040109
roc.curve(Profit.2017, qda.probs, plotit = F)
## Area under the curve (AUC): 0.849
Again, this model behaves similarly to the previously trained one on the unbalanced data. The AUC score is good enough to say the model achieves a good classification performance.