Analyzing a Superstore dataset

The aim of this work is to analyze a dataset of purchases in an anonymous online store. The analysis will consist of data cleaning, exploratory data analysis (EDA), a simple case of linear regression, a more complete study of multiple linear regression and finally a binary classification problem.

Installs and libraries

# Installs
#install.packages("caret")
#install.packages("car")
#install.packages("ggplot2")
#install.packages("devtools")
#install.packages("ggcorrplot")
#devtools::install_github("laresbernardo/lares")
#devtools::install_github("ProcessMiner/nlcor")
#install.packages("ISLR2")
#devtools::install_github("kassambara/ggpubr")
#install.packages("zoo")
#install.packages("ROSE")
#install.packages("e1071")
#install.packages("klaR")
#install.packages("treemapify")
#install.packages("glmnet")

# Libraries
library(knitr)
library(tidyverse)
library(ggplot2)
library(forcats)
library(dplyr)
library(ggcorrplot)
library(lares)
library(treemapify)
library(MASS)
library(devtools) 
library(ISLR2)
library(ggpubr)
library(zoo)
library(reshape2)
library(ROSE) # for undersampling/oversampling
library(e1071) # for naive Bayes
library(caret)
library(klaR)
library(car) # for calculation of VIF in multiple linear regression
library(glmnet)
library(choroplethr)
library(choroplethrMaps)
library(kableExtra)
library(magrittr)

Importing Dataset

The dataset has been obtained from Kaggle, the famous data science platform. It is publicly available for educational purposes and has been downloaded as csv file for ease of use. It consists of 9994 purchases made from an anonymous online store and contains the following features:

Row ID => Unique ID for each row.
Order ID => Unique Order ID for the purchase.
Order Date => Order Date of the product.
Ship Date => Shipping Date of the product.
Ship Mode=> Shipping Mode specified by the customer.
Customer ID => Unique ID to identify the customer.
Customer Name => Name of the customer.
Segment => The segment to which the customer belongs.
Country => Country where the purchase was made.
City => City where the purchase was made.
State => State where the purchase was made.
Postal Code => Postal Code of the customer.
Region => Region where the customer belongs.
Product ID => Unique ID of the product.
Category => Category of the product ordered.
Sub-Category => Sub-Category of the product ordered.
Product Name => Name of the product.
Sales => Sales of the product.
Quantity => Quantity of the product.
Discount => Discount provided.
Profit => Profit/Loss incurred.

# Read the dataset
df <- read.csv("~/Luis/Universidad/Master_DS/Second_semester/Statistical_Learning/Project/Sample-Superstore.csv",header=TRUE)
#kable(df[1:5,])

df[1:15,] %>%
  kable(format = "html", col.names = colnames(df)) %>%
  column_spec(c(2:21), width_min = "0.8in")  %>%
  kable_styling() %>%
  kableExtra::scroll_box(width = "100%", height = "300px")

Row.ID	Order.ID	Order.Date	Ship.Date	Ship.Mode	Customer.ID	Customer.Name	Segment	Country	City	State	Postal.Code	Region	Product.ID	Category	Sub.Category	Product.Name	Sales	Quantity	Discount	Profit
1	CA-2016-152156	11/8/2016	11/11/2016	Second Class	CG-12520	Claire Gute	Consumer	United States	Henderson	Kentucky	42420	South	FUR-BO-10001798	Furniture	Bookcases	Bush Somerset Collection Bookcase	261.9600	2	0.00	41.9136
2	CA-2016-152156	11/8/2016	11/11/2016	Second Class	CG-12520	Claire Gute	Consumer	United States	Henderson	Kentucky	42420	South	FUR-CH-10000454	Furniture	Chairs	Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back	731.9400	3	0.00	219.5820
3	CA-2016-138688	6/12/2016	6/16/2016	Second Class	DV-13045	Darrin Van Huff	Corporate	United States	Los Angeles	California	90036	West	OFF-LA-10000240	Office Supplies	Labels	Self-Adhesive Address Labels for Typewriters by Universal	14.6200	2	0.00	6.8714
4	US-2015-108966	10/11/2015	10/18/2015	Standard Class	SO-20335	Sean O’Donnell	Consumer	United States	Fort Lauderdale	Florida	33311	South	FUR-TA-10000577	Furniture	Tables	Bretford CR4500 Series Slim Rectangular Table	957.5775	5	0.45	-383.0310
5	US-2015-108966	10/11/2015	10/18/2015	Standard Class	SO-20335	Sean O’Donnell	Consumer	United States	Fort Lauderdale	Florida	33311	South	OFF-ST-10000760	Office Supplies	Storage	Eldon Fold ’N Roll Cart System	22.3680	2	0.20	2.5164
6	CA-2014-115812	6/9/2014	6/14/2014	Standard Class	BH-11710	Brosina Hoffman	Consumer	United States	Los Angeles	California	90032	West	FUR-FU-10001487	Furniture	Furnishings	Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood	48.8600	7	0.00	14.1694
7	CA-2014-115812	6/9/2014	6/14/2014	Standard Class	BH-11710	Brosina Hoffman	Consumer	United States	Los Angeles	California	90032	West	OFF-AR-10002833	Office Supplies	Art	Newell 322	7.2800	4	0.00	1.9656
8	CA-2014-115812	6/9/2014	6/14/2014	Standard Class	BH-11710	Brosina Hoffman	Consumer	United States	Los Angeles	California	90032	West	TEC-PH-10002275	Technology	Phones	Mitel 5320 IP Phone VoIP phone	907.1520	6	0.20	90.7152
9	CA-2014-115812	6/9/2014	6/14/2014	Standard Class	BH-11710	Brosina Hoffman	Consumer	United States	Los Angeles	California	90032	West	OFF-BI-10003910	Office Supplies	Binders	DXL Angle-View Binders with Locking Rings by Samsill	18.5040	3	0.20	5.7825
10	CA-2014-115812	6/9/2014	6/14/2014	Standard Class	BH-11710	Brosina Hoffman	Consumer	United States	Los Angeles	California	90032	West	OFF-AP-10002892	Office Supplies	Appliances	Belkin F5C206VTEL 6 Outlet Surge	114.9000	5	0.00	34.4700
11	CA-2014-115812	6/9/2014	6/14/2014	Standard Class	BH-11710	Brosina Hoffman	Consumer	United States	Los Angeles	California	90032	West	FUR-TA-10001539	Furniture	Tables	Chromcraft Rectangular Conference Tables	1706.1840	9	0.20	85.3092
12	CA-2014-115812	6/9/2014	6/14/2014	Standard Class	BH-11710	Brosina Hoffman	Consumer	United States	Los Angeles	California	90032	West	TEC-PH-10002033	Technology	Phones	Konftel 250 Conference<a0>phone<a0>- Charcoal black	911.4240	4	0.20	68.3568
13	CA-2017-114412	4/15/2017	4/20/2017	Standard Class	AA-10480	Andrew Allen	Consumer	United States	Concord	North Carolina	28027	South	OFF-PA-10002365	Office Supplies	Paper	Xerox 1967	15.5520	3	0.20	5.4432
14	CA-2016-161389	12/5/2016	12/10/2016	Standard Class	IM-15070	Irene Maddox	Consumer	United States	Seattle	Washington	98103	West	OFF-BI-10003656	Office Supplies	Binders	Fellowes PB200 Plastic Comb Binding Machine	407.9760	3	0.20	132.5922
15	US-2015-118983	11/22/2015	11/26/2015	Standard Class	HP-14815	Harold Pawlan	Home Office	United States	Fort Worth	Texas	76106	Central	OFF-AP-10002311	Office Supplies	Appliances	Holmes Replacement Filter for HEPA Air Cleaner, Very Large Room, HEPA Filter	68.8100	5	0.80	-123.8580

Data cleaning

In this section we will carry out the cleaning of the data. First of all we will check basic information, such as missing values (in the form of NaN, NA and Null), duplicates, number of unique values and determine which columns can be removed and which can be modified or created. After, we will focus our efforts on determining outliers.

Basic data cleaning

With a summary of the data we can know important information about each feature, such as its class, its mean, quartiles, maximum and minimum.

summary(df)

##      Row.ID       Order.ID          Order.Date         Ship.Date        
##  Min.   :   1   Length:9994        Length:9994        Length:9994       
##  1st Qu.:2499   Class :character   Class :character   Class :character  
##  Median :4998   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :4998                                                           
##  3rd Qu.:7496                                                           
##  Max.   :9994                                                           
##   Ship.Mode         Customer.ID        Customer.Name        Segment         
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Country              City              State            Postal.Code   
##  Length:9994        Length:9994        Length:9994        Min.   : 1040  
##  Class :character   Class :character   Class :character   1st Qu.:23223  
##  Mode  :character   Mode  :character   Mode  :character   Median :56431  
##                                                           Mean   :55190  
##                                                           3rd Qu.:90008  
##                                                           Max.   :99301  
##     Region           Product.ID          Category         Sub.Category      
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Product.Name           Sales              Quantity        Discount     
##  Length:9994        Min.   :    0.444   Min.   : 1.00   Min.   :0.0000  
##  Class :character   1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000  
##  Mode  :character   Median :   54.490   Median : 3.00   Median :0.2000  
##                     Mean   :  229.858   Mean   : 3.79   Mean   :0.1562  
##                     3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000  
##                     Max.   :22638.480   Max.   :14.00   Max.   :0.8000  
##      Profit         
##  Min.   :-6599.978  
##  1st Qu.:    1.729  
##  Median :    8.666  
##  Mean   :   28.657  
##  3rd Qu.:   29.364  
##  Max.   : 8399.976

Check if there are Na and NAN:

is.null(df)

## [1] FALSE

sum(is.na(df))

## [1] 0

sum(is.nan(as.matrix(df)))

## [1] 0

There are no missing values of any type. This is certainly not very surprising since most datasets available in Kaggle for educational use have usually gone through some sort of cleaning beforehand. Therefore, this means that for now we still have the 9994 examples available.

Number of duplicates:

sum(duplicated(df))

## [1] 0

No duplicates present. Same reasoning as for missing values apply.

Let’s have a look at the unique values in each column.

cat(paste("Sub.Category: ", length(unique(df[["Sub.Category"]])) ), sep="\n")

## Sub.Category:  17

cat(paste("Category: ", length(unique(df[["Category"]]))), sep="\n")

## Category:  3

cat(paste("Country: ", length(unique(df[["Country"]]))), sep="\n")

## Country:  1

cat(paste("Region: ", length(unique(df[["Region"]]))), sep="\n")

## Region:  4

cat(paste("State: ", length(unique(df[["State"]]))), sep="\n")

## State:  49

cat(paste("City: ", length(unique(df[["City"]]))), sep="\n")

## City:  531

cat(paste("Ship.Mode: ", length(unique(df[["Ship.Mode"]]))), sep="\n")

## Ship.Mode:  4

cat(paste("Customer.ID: ", length(unique(df[["Customer.ID"]]))), sep="\n")

## Customer.ID:  793

cat(paste("Order.ID: ", length(unique(df[["Order.ID"]]))), sep="\n")

## Order.ID:  5009

It’s shown that there are 17 different sub-categories grouped in 3 categories, 532 cities belonging to 49 states and 4 regions, 4 ship modes and 793 different customers. Having 9994 orders and 793 customers means a good quantity of those customers (if not all) have purchased more than one subcategory or made more than one order. Different subcategory products bought together will be found in different rows of the dataset, even though they share the same order number. This conclusion can be drawn from the number of unique orders, which is slightly bigger than half of the number of purchases and the fact that there are no duplicates . This is all good information that we can use later in the Exploratory Data Analysis section.

We can get rid of some columns. The name of the customer is not very useful when we already have a unique customer ID for each client. This is also a good practice to maintain the anonymity of customers. The Country column is also not relevant since there is only data for the USA. And finally the postal code column won’t be needed since we will not carry our analysis in such detail and with the city and state columns it will be enough.

df = df[-c(1,7,9,12)]

Convert date columns to date format:

df$Order.Date <- as.Date(df$Order.Date, format = "%m/%d/%Y")
df$Ship.Date <- as.Date(df$Ship.Date, format = "%m/%d/%Y")

Next, we can create some useful columns. For example, if we subtract the Order.Date from the Ship.Date we get the Processing.Time, this is, how much time the store needed to ship the product since it was ordered. The gross margin is also useful to have, it is calculated as the ratio of Profit to Sales multiplied by 100 (in %). Finally, we can separate the Order.Date column into three other columns for the year, the month and the information concerning the year and month.

df$Order.Year <- format(df$Order.Date,"%Y")
df$Order.Year <- as.numeric(df$Order.Year)                                
df$Order.Month <- format(df$Order.Date,"%m")
df$Order.Month <- as.numeric(df$Order.Month)
df$Order.Year_month <- format(df$Order.Date,"%Y-%m")
df$Order.Year_month <- as.Date(paste(df$Order.Year_month,"-15",sep=""))
df$Order.Date <- as.numeric(df$Order.Date)
df$Processing.Time <- df$Ship.Date - df$Order.Date
df$Processing.Time <- as.numeric(df$Processing.Time)
df$Gross.Margin <- (df$Profit/df$Sales)*100

Let’s have a look at the final structure of the data after doing the aforementioned modifications.

summary(df)

##    Order.ID           Order.Date      Ship.Date           Ship.Mode        
##  Length:9994        Min.   :16073   Min.   :2014-01-07   Length:9994       
##  Class :character   1st Qu.:16578   1st Qu.:2015-05-27   Class :character  
##  Mode  :character   Median :16978   Median :2016-06-29   Mode  :character  
##                     Mean   :16921   Mean   :2016-05-03                     
##                     3rd Qu.:17300   3rd Qu.:2017-05-18                     
##                     Max.   :17530   Max.   :2018-01-05                     
##  Customer.ID          Segment              City              State          
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     Region           Product.ID          Category         Sub.Category      
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Product.Name           Sales              Quantity        Discount     
##  Length:9994        Min.   :    0.444   Min.   : 1.00   Min.   :0.0000  
##  Class :character   1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000  
##  Mode  :character   Median :   54.490   Median : 3.00   Median :0.2000  
##                     Mean   :  229.858   Mean   : 3.79   Mean   :0.1562  
##                     3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000  
##                     Max.   :22638.480   Max.   :14.00   Max.   :0.8000  
##      Profit            Order.Year    Order.Month    Order.Year_month    
##  Min.   :-6599.978   Min.   :2014   Min.   : 1.00   Min.   :2014-01-15  
##  1st Qu.:    1.729   1st Qu.:2015   1st Qu.: 5.00   1st Qu.:2015-05-15  
##  Median :    8.666   Median :2016   Median : 9.00   Median :2016-06-15  
##  Mean   :   28.657   Mean   :2016   Mean   : 7.81   Mean   :2016-04-29  
##  3rd Qu.:   29.364   3rd Qu.:2017   3rd Qu.:11.00   3rd Qu.:2017-05-15  
##  Max.   : 8399.976   Max.   :2017   Max.   :12.00   Max.   :2017-12-15  
##  Processing.Time  Gross.Margin    
##  Min.   :0.000   Min.   :-275.00  
##  1st Qu.:3.000   1st Qu.:   7.50  
##  Median :4.000   Median :  27.00  
##  Mean   :3.958   Mean   :  12.03  
##  3rd Qu.:5.000   3rd Qu.:  36.25  
##  Max.   :7.000   Max.   :  50.00

Outliers detection

Let’s have a look again at the summary of columns that could potentially have some outliers.

summary(df$Sales)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     0.444    17.280    54.490   229.858   209.940 22638.480

summary(df$Quantity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    3.00    3.79    5.00   14.00

summary(df$Discount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.2000  0.1562  0.2000  0.8000

summary(df$Profit)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -6599.978     1.729     8.666    28.657    29.364  8399.976

summary(df$Processing.Time)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   4.000   3.958   5.000   7.000

summary(df$Order.Year)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2014    2015    2016    2016    2017    2017

summary(df$Gross.Margin)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -275.00    7.50   27.00   12.03   36.25   50.00

summary(df$Order.Date)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16073   16578   16978   16921   17300   17530

By checking the maximum and minimum values it is hard to tell if there could be outliers, since, even though in some cases the maxima are a bit far from the 3rd quartile, they are all possible values for the quantities they describe.

Let’s plot the numerical variables to have a look at their distribution.

hst_Sales <- ggplot(df) + aes(x = Sales) + 
geom_histogram(color="white", fill="#80b1d3", bins = 100) + 
labs(title="", x="Sales", y="Frequency") +
theme_minimal()+ 
theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28), 
  axis.title.y = element_text(size = 28), title = element_text(size = 16))

hst_Quantity <- ggplot(df) + aes(x = Quantity) + 
geom_histogram(color="white", fill="#80b1d3", bins = 14) + 
labs(title="", x="Quantity", y="Frequency") + 
theme_minimal()+ 
theme(axis.text=element_text(size=24), 
  axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
  title = element_text(size = 16))

hst_Discount <- ggplot(df) + aes(x = Discount) + 
geom_histogram(color="white", fill="#80b1d3", bins = 9) + 
labs(title="", x="Discount", y="Frequency") + 
theme_minimal()+ 
theme(axis.text=element_text(size=24), 
  axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
  title = element_text(size = 16))

hst_Profit <- ggplot(df) + aes(x = Profit) + 
geom_histogram(color="white", fill="#80b1d3", bins = 120) + 
labs(title="", x="Profit", y="Frequency") + 
theme_minimal()+ 
theme(axis.text=element_text(size=24), 
  axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
  title = element_text(size = 16))

hst_Processing <- ggplot(df) + aes(x = Processing.Time) + 
geom_histogram(color="white", fill="#80b1d3", bins = 8)  + 
labs(title="", x="Processing time", y="Frequency") +
theme_minimal()+  
theme(axis.text=element_text(size=24), 
  axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
  title = element_text(size = 16))

hst_Year <- ggplot(df) + aes(x = Order.Year) + 
geom_histogram(color="white", fill="#80b1d3", bins = 4)  + 
labs(title="", x="Years", y="Frequency") + 
theme_minimal()+ 
theme(axis.text=element_text(size=24), 
  axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
  title = element_text(size = 16))

################################################################################

require(gridExtra)
grid.arrange(hst_Sales, hst_Quantity, hst_Year, hst_Discount, hst_Processing, hst_Profit, ncol = 2, nrow = 3)

Processing time, Years, Discount and Quantity look fine. The latter presents some right skewness but for our criteria is not a case of possible outliers. The Sales and profit plots require some transformation before being able to draw a conclusion. A log transformation should do the work.

df_positive_profit <- df %>% filter(Profit >= 0)
df_negative_profit <- df %>% filter(Profit < 0)

plot1 <- ggplot(df_positive_profit) + aes(x = log(Profit)) + 
  geom_histogram(color="white", fill="#80b1d3",  bins = 12) + 
  labs(title="", x="Log(Pos. Profit)", y="Frequency") + 
  theme_minimal()+ 
  theme(axis.text=element_text(size=24), 
    axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28), title = element_text(size = 16))

plot2 <- ggplot(df_negative_profit) + aes(x = log(-Profit)) + 
  geom_histogram(color="white", fill="#80b1d3", bins = 12) + 
  labs(title="", x="Log(Neg. Profit)", y="Frequency") + 
  theme_minimal()+ 
  theme(axis.text=element_text(size=24), 
    axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28), title = element_text(size = 16))


plot3 <- ggplot(df) + aes(x = log(Sales)) + 
  geom_histogram(color="white", fill="#80b1d3", bins = 30) + 
  labs(title="", x="log(Sales)", y="Frequency") + 
  theme_minimal()+ 
  theme(axis.text=element_text(size=24), 
    axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28), title = element_text(size = 16))


df_profit <- df[!(df$Profit < 600 & df$Profit >= -600 ),]
plot4 <- ggplot(df_profit) + aes(x = Profit) + 
  geom_histogram(color="white", fill="#80b1d3") + 
  labs(title="", x="Profit (omitting central values)", y="Frequency") + 
  theme_minimal()+ 
  theme(axis.text=element_text(size=24), 
    axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28), title = element_text(size = 16))

################################################################################


require(gridExtra)
grid.arrange(plot1, plot2, plot3, plot4, ncol = 2, nrow = 2)

The log of Sales looks pretty Gaussian to the eye and the skewness is gone. The variable Profit presents both negative and positive values, therefore we separate them both in two histograms, taking the absolute value of the negative one. We also show the Profit distribution without any transformation but hiding the central highest values so the the tails are easier to see. There seem to be no strong indications of the presence of outliers for any of the quantities.

Finally, we can check the box plots of all numeric variables. Box plots are one of the most useful tools in outliers identification.

plot1 <- ggplot(df, aes(y = Quantity)) + 
  geom_boxplot(color="black", fill="#80b1d3") + theme_minimal() +
  theme(axis.text=element_text(size=24), 
  axis.title.x = element_text(size = 28), 
  axis.title.y = element_text(size = 28), title = element_text(size = 16))

plot2 <- ggplot(df, aes(y = Discount)) + 
  geom_boxplot(color="black", fill="#80b1d3") + theme_minimal() +
  theme(axis.text=element_text(size=24), 
  axis.title.x = element_text(size = 28), 
  axis.title.y = element_text(size = 28), title = element_text(size = 16))

plot3 <- ggplot(df, aes(y = Sales)) + 
  geom_boxplot(color="black", fill="#80b1d3") + theme_minimal() +
  theme(axis.text=element_text(size=24), 
  axis.title.x = element_text(size = 28), 
  axis.title.y = element_text(size = 28), title = element_text(size = 16))

plot4 <- ggplot(df, aes(y = Profit)) + 
  geom_boxplot(color="black", fill="#80b1d3") + theme_minimal() +
  theme(axis.text=element_text(size=24), 
  axis.title.x = element_text(size = 28), 
  axis.title.y = element_text(size = 28), title = element_text(size = 16))

plot5 <- ggplot(df, aes(y = Processing.Time)) + 
  geom_boxplot(color="black", fill="#80b1d3") + theme_minimal() +
  theme(axis.text=element_text(size=24), 
  axis.title.x = element_text(size = 28), 
  axis.title.y = element_text(size = 28), title = element_text(size = 16))

# Alternatively we can also separate the quantitative variable per the qualitative one, for example:

plot6 <- ggplot(df, aes(x = Category, y = Discount)) + 
  geom_boxplot(color="black", fill="#80b1d3") + theme_minimal() +
  theme(axis.text=element_text(size=24), 
  axis.title.x = element_text(size = 28), 
  axis.title.y = element_text(size = 28), title = element_text(size = 16))

plot7 <- ggplot(df, aes(x = Region, y = Sales)) + 
  geom_boxplot(color="black", fill="#80b1d3") + theme_minimal() +
  theme(axis.text=element_text(size=24), 
  axis.title.x = element_text(size = 28), 
  axis.title.y = element_text(size = 28), title = element_text(size = 16))

plot8 <- ggplot(df, aes(x = Category, y = Profit)) + 
  geom_boxplot(color="black", fill="#80b1d3") + theme_minimal() +
  theme(axis.text=element_text(size=24), 
  axis.title.x = element_text(size = 28), 
  axis.title.y = element_text(size = 28), title = element_text(size = 16))

################################################################################

require(gridExtra)
grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, plot7, plot8, ncol = 4, nrow = 2)

Except for Processing Time, there are potential outliers in almost all variables. Sales and Profits stand out. Quantity and Discount present a few possible outliers. As seen at the beginning, the maximum discount corresponds to 0.8, which although rare, it is a value that can occur on specific occasions. The maximum value of Quantity is 14, which is also perfectly possible, especially in the case of items that are purchased in large quantities and also given that we have several types of customers. Those buying for their business or company would be expected to buy in larger quantities.

For these reasons and those stated above, we believe that we do not have enough information about the store and its functioning to be able to really determine if some of the purchases made can be classified as outliers. In the absence of a safe and definitive criterion, we chose not to eliminate any of the potential outlier points.

Relevant questions

As a starting point to develop the following sections of the work, we will try to answer the following questions.

Which subcategories are more popular? Are those the ones that generate more profit?
Do different segments prefer different ship modes?
In which states are there more orders? And in which cities? Do they correspond to the ones that generate more profit?
Which is the top-10 clients according to purchases and to which segment do they belong? Are they the ones that generate more profit?
Do different segments buy some categories more than others?
Is there a difference between the profit generated by each segment in the different categories? And between the profit generated in each region in each category?
Does the quantity of products bought change in each category when there is a discount?
How does the profit change for different discount ranges? What would be a threshold discount for positive/negative profits?
How do orders behave throughout the year? Can we identify some patterns?
Do sales and profits behave similarly for every month?
Which segment is generating more sales? In which region?
Do we observe a different behavior for holiday periods (November and December in the USA)?
What is the relation between profit and sales? Is it linear?
How accurately can we estimate this relation?
How accurately can we predict future profits based on sales?
What features are more relevant when determining the profit?
Is there any interaction among the different features?
Is it possible to predict if a purchase will result on positive or negative profit? Which features are more relevant for this task? How accurate is the prediction?

Exploratory Data Analysis

summary(df)

##    Order.ID           Order.Date      Ship.Date           Ship.Mode        
##  Length:9994        Min.   :16073   Min.   :2014-01-07   Length:9994       
##  Class :character   1st Qu.:16578   1st Qu.:2015-05-27   Class :character  
##  Mode  :character   Median :16978   Median :2016-06-29   Mode  :character  
##                     Mean   :16921   Mean   :2016-05-03                     
##                     3rd Qu.:17300   3rd Qu.:2017-05-18                     
##                     Max.   :17530   Max.   :2018-01-05                     
##  Customer.ID          Segment              City              State          
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     Region           Product.ID          Category         Sub.Category      
##  Length:9994        Length:9994        Length:9994        Length:9994       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Product.Name           Sales              Quantity        Discount     
##  Length:9994        Min.   :    0.444   Min.   : 1.00   Min.   :0.0000  
##  Class :character   1st Qu.:   17.280   1st Qu.: 2.00   1st Qu.:0.0000  
##  Mode  :character   Median :   54.490   Median : 3.00   Median :0.2000  
##                     Mean   :  229.858   Mean   : 3.79   Mean   :0.1562  
##                     3rd Qu.:  209.940   3rd Qu.: 5.00   3rd Qu.:0.2000  
##                     Max.   :22638.480   Max.   :14.00   Max.   :0.8000  
##      Profit            Order.Year    Order.Month    Order.Year_month    
##  Min.   :-6599.978   Min.   :2014   Min.   : 1.00   Min.   :2014-01-15  
##  1st Qu.:    1.729   1st Qu.:2015   1st Qu.: 5.00   1st Qu.:2015-05-15  
##  Median :    8.666   Median :2016   Median : 9.00   Median :2016-06-15  
##  Mean   :   28.657   Mean   :2016   Mean   : 7.81   Mean   :2016-04-29  
##  3rd Qu.:   29.364   3rd Qu.:2017   3rd Qu.:11.00   3rd Qu.:2017-05-15  
##  Max.   : 8399.976   Max.   :2017   Max.   :12.00   Max.   :2017-12-15  
##  Processing.Time  Gross.Margin    
##  Min.   :0.000   Min.   :-275.00  
##  1st Qu.:3.000   1st Qu.:   7.50  
##  Median :4.000   Median :  27.00  
##  Mean   :3.958   Mean   :  12.03  
##  3rd Qu.:5.000   3rd Qu.:  36.25  
##  Max.   :7.000   Max.   :  50.00

Analysis of discrete variables

First of all let’s see the proportion that each subcategory represents in general and inside each category. We can see that Binders, Paper, Furnishing and Phones (in that order) are the more popular subcategories. The less popular ones are Copiers and Machines, followed by Bookcases, Envelopes, Fasteners and Supplies. When talking about categories, Office Supplies is the most represented one. Both Technology and Furniture have a similar number of items.

plotdata <- df %>% count(Sub.Category) %>% arrange(desc(Sub.Category)) %>%
      mutate(prop = round(n*100/sum(n), 1), lab.ypos = cumsum(prop) - 0.5*prop)

plotdata$label <- paste0(plotdata$Sub.Category, "\n", round(plotdata$prop), "%")

plot1 <- ggplot(plotdata, aes(x = "", y = prop, fill = Sub.Category)) +
  geom_bar(width = 1, stat = "identity", color = "white") +
  coord_polar(theta = "y", start = 1.4, direction = -1) +
  theme(legend.position = "FALSE", axis.text=element_text(size=32))+
  scale_y_continuous(breaks=cumsum(plotdata$prop) - plotdata$prop / 2, 
  labels= plotdata$label) + labs(y = "", x = "", title = "") + theme_minimal()+   theme(axis.text=element_text(size=24), axis.title.x = element_text(size =28),   axis.title.y = element_text(size = 28), legend.text = element_text(size =24),
  legend.title = element_text(size =28))


################################################################################

df_cat <- df %>% group_by(Category, Sub.Category) %>% summarize(count=n())

plot2 <- ggplot(df_cat, aes(area=count, label=Sub.Category, fill=Category, subgroup=Category))+
    geom_treemap()+
    geom_treemap_subgroup_border(colour = "white", size = 5)+
    geom_treemap_subgroup_text(color='white', alpha=0.3, size=48,
    place='center', fontface='bold')+
    geom_treemap_text(color='white', place='center', size=34) +
    theme(legend.text = element_text(size = 24), 
    legend.title = element_text(size = 28))

################################################################################

require(gridExtra)
grid.arrange(plot1, plot2, ncol = 2, nrow = 1)

Now we will construct a set of useful barplots to understand better some relations between the features in the dataset. First, we plot the top-15 states and cities where more orders are made. California, New York and Texas are the states that make the most orders. Most of these states belong to the top-15 most populated states in the USA for the years of interest. Regarding the top-15 cities, they are mostly, as expected, cities of the states where more orders have been made. Los Angeles and New York City are the top-2 since they are also the most populated cities in the USA. The most bought category for all states and cities is Office Supplies.

The most popular shipping mode is Standard Class for all segments, followed by Second Class and First Class. This is also true for each category. The amount of orders belonging to the Office Supplies category is more than the sum of the orders of the two other categories.

The south region presents the lowest amount of orders while the west region the most. Consumer is the type of client that more orders make for each region and Home Office the one that makes less.

Regarding the profit, East and West are the regions that more profit generate. Consumer type of clients represent the most profit but also the most losses.

top_states <- tail(names(sort(table(df$State))),15)
df_top_states <- df %>% filter(df$State %in% top_states)
plot1 <- ggplot(df_top_states, aes(x = State, fill = Category)) + 
  geom_bar(position = "stack", bins = 15) + 
  labs(title="", x="State", y="Orders") + 
  theme_minimal()+ 
  theme(axis.text=element_text(size=24), axis.text.x = element_text(angle=45),
  axis.title.x = element_text(size =28),   axis.title.y=element_text(size=28),
  plot.title = element_text(size = 24, hjust = 0.5), 
  legend.text = element_text(size = 24), legend.title = element_text(size =28)) +
  scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))

################################################################################

top_cities <- tail(names(sort(table(df$City))),15)
df_top_cities <- df %>% filter(df$City %in% top_cities)
plot2 <- ggplot(df_top_cities, aes(x = City, fill = Category)) + 
  geom_bar(position = "stack", bins = 15) + 
  labs(title="", x = "City", y="Orders") + 
  theme_minimal() + 
  theme(axis.text=element_text(size=24), axis.text.x = element_text(angle=45), 
    axis.title.x = element_text(size = 28), axis.title.y = element_text(size=28),
    plot.title = element_text(size = 24, hjust = 0.5), 
    legend.text = element_text(size = 24), legend.title = element_text(size=28))+
  scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))


################################################################################

df2 <- df %>% group_by(Category)
plot3 <- ggplot(data = df2,aes(Category, fill=`Ship.Mode`))+
  geom_bar(position = 'stack')+
  labs(title='', x='Category', y='Orders') +
  theme_minimal() + 
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
    axis.title.y = element_text(size = 28),
    plot.title = element_text(size = 24, hjust = 0.5),  
    legend.text = element_text(size = 24), legend.title = element_text(size =28)) +
  scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))

################################################################################

df2 <- df %>% group_by(Segment)
plot4 <- ggplot(data = df2, aes(Segment, fill=`Ship.Mode`))+
  geom_bar(position = 'stack')+
  labs(title='', x='Segment', y='Orders') +
  theme_minimal() + 
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
    axis.title.y = element_text(size = 28),  
    plot.title = element_text(size = 24, hjust = 0.5),  
    legend.text = element_text(size = 24), legend.title = element_text(size=28))+
  scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))

################################################################################

df2 <-df %>% group_by(Region)
plot5 <- ggplot(data = df2, aes(Region, fill=`Segment`))+
  geom_bar(position = 'stack')+
  theme_minimal() + 
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
    axis.title.y = element_text(size = 28), 
    plot.title = element_text(size = 24, hjust = 0.5),  
    legend.text = element_text(size = 24), legend.title = element_text(size=28))+
  labs(title='', x='Region', y='Orders') +
  scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))

  
################################################################################

plot6 <- ggplot(df, aes(x=Region, y=Profit, fill = Segment)) + 
  geom_bar(stat = "identity") + labs(x="Region", y="Profit") + theme_minimal() + 
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
    axis.title.y = element_text(size = 28), 
    plot.title = element_text(size = 16, hjust = 0.5),
    legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
  scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))+
  scale_y_continuous(label = scales::dollar)+
  geom_hline(yintercept=0, linetype="dashed")


################################################################################

require(gridExtra)
grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, ncol = 2, nrow = 3)

In the next plots we represent the total number of orders, sales and profit generated in each month. During the last months of the year both sales and orders are higher than at the beginning. Up to three times more orders and sales are made in November and December as compared to January and February. Also, both positive and negative profits increase towards the last months of the year. The number of orders also increase throughout the years and the proportions of the shipping mode used remain the same.

# Purchases, sales and benefits for years and months. In what months do people usually buy more? 
# Are the sales generally increasing over time? Are there more sales on Christmas, Black Friday...?

plot1 <- ggplot(df, aes(x=Order.Month, fill = Category))+ 
  geom_bar(position = "stack", bins = 12)+ 
  labs(title="", x = "Month", y="Orders")+ 
  theme_minimal() +  
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
    axis.title.y = element_text(size = 28), 
    plot.title = element_text(size = 24, hjust = 0.5),  
    legend.text = element_text(size = 24), legend.title = element_text(size=28))+
  scale_x_discrete(limits= seq(1,12))+ 
  scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))

################################################################################

plot2 <- ggplot(df, aes(x=Order.Month, y=Sales, fill = Category))+ 
  geom_bar(stat = "identity", bins = 12) + labs(title="", x="Month", y="Sales") + 
  theme_minimal() + theme(axis.text=element_text(size=24), 
    axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
    plot.title = element_text(size = 16, hjust = 0.5), 
    legend.text = element_text(size = 24), legend.title = element_text(size=28)) + 
  scale_x_discrete(limits= seq(1,12)) + scale_y_continuous(label = scales::dollar) +
  scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))

################################################################################

plot3 <- ggplot(df, aes(x=Order.Month, y= Profit, fill = Category))+
  geom_bar(stat = "identity") + labs(title="", x="Month", y="Profit") + 
  theme_minimal() + theme(axis.text=element_text(size=24), 
    axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
    plot.title = element_text(size = 16, hjust = 0.5), 
    legend.text = element_text(size = 24), legend.title = element_text(size=28)) + 
  scale_x_discrete(limits= seq(1,12)) + scale_y_continuous(label = scales::dollar) +
  scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))+
  geom_hline(yintercept=0, linetype="dashed")

################################################################################
# Is there any preference on shipment methods over the years?

plot4 <- ggplot(df, aes(x=Order.Year, fill = Ship.Mode))+ 
  geom_bar(position = "stack",  bins = 4)+ 
  labs(title="", x = "Year", y="Orders")+ 
  theme_minimal()+ 
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
    axis.title.y = element_text(size = 28), 
    plot.title = element_text(size = 24, hjust = 0.5),  
    legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
  scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))

require(gridExtra)
grid.arrange(plot2, plot1, plot3, plot4, ncol = 2, nrow = 2)

Next off, we study which subcategories represent the highest profits (sum of positive and negative) and which the highest losses (sum of positive and negative). Copiers belong to the top-5 subcategories that generate positive profit, but it is one of the least popular subcategories as we saw at the beginning of this section. Tables and Machines are subcategories that need to be closely watched given their losses are very high.

df_subcat <- df %>% group_by(Sub.Category) %>% summarise(prof = sum(Profit))
df_subcat <- df_subcat[order(df_subcat$prof),]
top_subcat_neg <- head(df_subcat$Sub.Category, 5)
top_subcat_pos <- tail(df_subcat$Sub.Category, 5)
df_top_pos <- df %>% filter(Sub.Category %in% top_subcat_pos)

plot_top_pos <- ggplot(df_top_pos, aes(x = Sub.Category, y = Profit, fill = Segment)) + 
  geom_bar(stat="identity", position = "stack", bins = 5) + 
  labs(x = "Subcategory", y= "Profit") + 
  theme_minimal() + scale_y_continuous(label = scales::dollar) +
  theme(axis.text=element_text(size=24), axis.text.x = element_text(angle=45), 
    axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28), 
    plot.title = element_text(size = 38, hjust = 0.5), 
    legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
  ggtitle("Top subcategories with positive profit") +
  scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))+
  geom_hline(yintercept=0, linetype="dashed")
  
  
################################################################################

df_top_neg <- df %>% filter(Sub.Category %in% top_subcat_neg)

plot_top_neg <- ggplot(df_top_neg, aes(x = Sub.Category, y = Profit, fill = Segment)) + 
  geom_bar(stat = "identity", position = "stack", bins = 5) + 
  labs(x = "Subcategory", y= "Profit") + scale_y_continuous(label = scales::dollar) +
  theme_minimal() +
  theme(axis.text=element_text(size=24), axis.text.x = element_text(angle=45), 
    axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28), 
    plot.title = element_text(size = 38, hjust = 0.5),
    legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
  ggtitle("Top subcategories with negative profit") +
  scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))+
  geom_hline(yintercept=0, linetype="dashed")

################################################################################

require(gridExtra)
grid.arrange(plot_top_pos, plot_top_neg, ncol = 2, nrow = 1)

Total orders by state

In the map presented below we can see which of the states are associated with a highest number of orders. This information completes the one given earlier about the top-10 states regarding the number of orders. As expected, the most populated states are those where more orders are made: California, Texas, Florida, New York, Pennsylvania… Likewise, the most sparse are associated with a smaller number of purchases.

df_mapusa <- df %>% group_by(State)
df_mapusa <- df_mapusa %>% summarize(orders = n())
df_mapusa$region <- tolower(df_mapusa$State)
df_mapusa$value <- df_mapusa$orders
continental_us_states <- names(table(df_mapusa$region))

state_choropleth(df_mapusa, num_colors=9, zoom = continental_us_states) +
  scale_fill_brewer(palette="YlOrBr") +
  labs(fill = "Total orders")

Total profit by state

We can now check which states generate the most and the least profit and study if they are related to the ones where more purchases were made. Indeed, some states as California and New York and Washington generate both big profit and number of orders. Nevertheless, states as Florida, Texas, Pennsylvania and Colorado despite being associated with high number of purchases generate very low or negative profit. Also remarkable is that states like Nevada and Montana generate a decent amount of profits although they are states where not many purchases are made.

df_mapusa <- df %>% group_by(State) %>% summarize(profitss = sum(Profit))
df_mapusa$region <- tolower(df_mapusa$State)
df_mapusa$value <- df_mapusa$profitss
continental_us_states <- names(table(df_mapusa$region))

state_choropleth(df_mapusa, num_colors=9, zoom = continental_us_states) +
  scale_fill_brewer(palette="YlOrBr") +
  labs(fill = "Total profit")

Analysis of holiday and preholiday periods

Next we see how the store has done in the four years our data covers, by studying how sales and profits have behaved (left barplot below). During the first two years the number of sales remained constant, while the profit increased slightly, proving the store was able to improve their gross margin. Next two years both sales and profits grew to a greater extent.

We are also interested in analyzing what impact the holiday season in the USA (last two months of the year) has on the number of orders (right barplot below). First, there is an increase in the number of orders throughout the years for both periods. Furthermore, the number of orders during holiday season is almost half of the number for the rest of the year; maintaining this constant relationship over the years. Notice that we are comparing two months against 10. Therefore holiday season has a big impact on the number of orders that are made in the store.

Sales_per_year <- df %>%  group_by(Order.Year) %>% summarise(s = sum(Sales))

Profit_per_year <- df %>% group_by(Order.Year) %>% summarise(p = sum(Profit))

Sales <- Sales_per_year$s
Profit <- Profit_per_year$p
Years <- c("2014", "2015", "2016", "2017")

df_growthsales_profit <- data.frame(Sales, Profit, Years)
df_growthsales_profit2 <- melt(df_growthsales_profit, id.vars="Years")

plot_year <- ggplot(df_growthsales_profit2, aes(x=Years, y=value, fill=variable))+
  geom_bar(stat="identity", position="dodge") + 
  labs(title="Sales and Profit per year", x = "Years", y="USD") + 
  theme_minimal() +
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
    axis.title.y = element_text(size = 28), 
    plot.title = element_text(size = 38, hjust = 0.5),
    legend.text = element_text(size = 24), legend.title = element_text(size=28)) + 
  scale_fill_manual(values = c("#da6474", "#80b1d3", "#fdb462", "#7fc97f"))


###############################################################################

Order_preholiday <- df %>% filter(Order.Month %in% 1:10) %>% 
                    group_by(Order.Year) %>% summarise(ord = n())

Order_holiday <- df %>% filter(Order.Month %in% 11:12) %>% 
                 group_by(Order.Year) %>% summarise(ord1= n())  

Pre_holiday <- Order_preholiday$ord
Holiday <- Order_holiday$ord1
Years <- c("2014", "2015", "2016", "2017")

df_growthorder <- data.frame(Pre_holiday, Holiday, Years)
df_growthorder2 <- melt(df_growthorder, id.vars='Years')

plot_hol <- ggplot(df_growthorder2, aes(x=Years, y=value, fill=variable)) +
    geom_bar(stat='identity', position='dodge') + 
    labs(title="Orders Pre-Holiday vs Holiday", x = "Years", y="Orders") + 
    theme_minimal() +
    theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
      axis.title.y = element_text(size = 28), 
      plot.title = element_text(size = 38, hjust = 0.5),
      legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
    scale_fill_manual(values = c("#da6474", "#80b1d3", "#fdb462", "#7fc97f" ))

################################################################################

require(gridExtra)
grid.arrange(plot_year, plot_hol, ncol = 2, nrow = 1)

Similar to the analysis carried out before, we can check how sales and profits generated by each category behave over the years. The first thing that catches our attention is the fact that the profit has increased over the years for the Office Supply and Technology categories, but not for Furniture, for which it has fluctuated around the same value. However, sales generally show a tendency to increase over the years. Both Office Supplies and Technology have very similar behavior.

Technology_sales <- df %>% filter(str_detect(Category, 'Technology')) %>%
  group_by(Order.Year) %>% summarise(Sales_technology = sum(Sales))

 Technology_profit <- df %>% filter(str_detect(Category, "Technology")) %>%
   group_by(Order.Year) %>% summarise(Profit_technology = sum(Profit))

Years <- c("2014", "2015", "2016", "2017")
Sales <- Technology_sales$Sales_technology
Profit <- Technology_profit$Profit_technology

df_technology_sp <- data.frame(Sales, Profit, Years)
df_technology_sp2 <- melt(df_technology_sp, id.vars="Years")

plot_technology <- ggplot(df_technology_sp2, aes(x=Years, y=value, fill=variable)) +
  geom_bar(stat="identity", position="dodge") + 
  labs(title="Sales and Profit for Technology", x = "Years", y="USD") + 
  theme_minimal() +
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
    axis.title.y = element_text(size = 28), 
    plot.title = element_text(size = 38, hjust = 0.5),
    legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
  scale_fill_manual(values = c("#da6474", "#80b1d3", "#fdb462", "#7fc97f" ))

############################################################################

Furniture_sales <- df %>% filter(str_detect(Category, 'Furniture')) %>%
  group_by(Order.Year) %>% summarise(Sales_furniture = sum(Sales))

Furniture_profit <- df %>% filter(str_detect(Category, "Furniture")) %>%
  group_by(Order.Year) %>% summarise(Profit_furniture = sum(Profit))

Profit <- Furniture_profit$Profit_furniture
Sales <- Furniture_sales$Sales_furniture

df_furniture_sp <- data.frame(Sales, Profit, Years)
df_furniture_sp2 <- melt(df_furniture_sp, id.vars="Years")

plot_furniture <- ggplot(df_furniture_sp2, aes(x=Years, y=value, fill=variable)) +
  geom_bar(stat="identity", position="dodge") + 
  labs(title="Sales and Profit for Furniture", x = "Years", y="USD") + 
  theme_minimal() +
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
    axis.title.y = element_text(size = 28), 
    plot.title = element_text(size = 38, hjust = 0.5),
    legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
  scale_fill_manual(values = c("#da6474", "#80b1d3", "#fdb462", "#7fc97f" ))

#############################################################################

Office_supplies_sales <- df %>% filter(str_detect(Category, 'Office Supplies')) %>%
  group_by(Order.Year) %>% summarise(Sales_office_supplies = sum(Sales))

Office_supplies_profit <- df %>% filter(str_detect(Category, "Office Supplies")) %>%
   group_by(Order.Year) %>% summarise(Profit_office_supplies = sum(Profit))

Profit <- Office_supplies_profit$Profit_office_supplies
Sales <- Office_supplies_sales$Sales_office_supplies

df_office_supplies_sp <- data.frame(Sales, Profit, Years)
df_office_supplies_sp2 <- melt(df_office_supplies_sp, id.vars="Years")

plot_office <- ggplot(df_office_supplies_sp2, aes(x=Years, y=value, fill=variable)) +
   geom_bar(stat="identity", position="dodge") + 
   labs(title="Sales and Profit for Office Supplies", x = "Years", y="USD") +
   theme_minimal() +
   theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
    axis.title.y = element_text(size = 28), 
    plot.title = element_text(size = 38, hjust = 0.5),
    legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
   scale_fill_manual(values = c("#da6474", "#80b1d3", "#fdb462", "#7fc97f" ))

################################################################################

require(gridExtra)
grid.arrange(plot_technology, plot_furniture, plot_office, ncol = 3, nrow = 1)

[cell below]

Following we will analyze four plots that would help us understand how the profit changes for different categories, regions and segments as well as the relation between discount and quantity bought.

df1 <- df %>% group_by(Segment,Category) %>% summarise(n=median(Profit))
plot1 <- ggplot(df1, aes(Segment, Category, fill=n))+
  scale_fill_distiller( direction = 1)+ geom_tile(color='white')+
  geom_text(aes(label=paste(round(n,0),'$')), color = 'black', size=10)+
  labs(title='', fill='Median Profit') + theme_minimal() +
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
    axis.title.y = element_text(size = 28), 
    plot.title = element_text(size = 16, hjust = 0.5),
    legend.text = element_text(size = 24), legend.title = element_text(size=28))

################################################################################

df2 <- df %>% group_by(Region,Category) %>% summarise(n=median(Profit))
plot2 <- ggplot(df2, aes(Region, Category, fill=n))+
  scale_fill_distiller( direction = 1)+ geom_tile(color='white')+
  geom_text(aes(label=paste(round(n,0),'$')), color = 'black', size=10)+
  labs(title='', fill='Median Profit') + theme_minimal() +
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
    axis.title.y = element_text(size = 28), 
    plot.title = element_text(size = 16, hjust = 0.5),
    legend.text = element_text(size = 24), legend.title = element_text(size=28))

################################################################################

# df3 <- df %>% mutate(Discount = cut_width(Discount,0.2,boundary=0))
df3 <- df %>% mutate(Discount = cut(Discount, c(0, 0.1, 0.2, 0.8), include.lowest = TRUE))
df3 <- df3 %>% group_by(Discount, Sub.Category) %>% summarise(n = mean(Quantity))

plot3 <- ggplot(df3, aes(Discount, Sub.Category, fill= n))+
  scale_fill_distiller( direction = 1) + geom_tile(color='white')+
  geom_text(aes(label=paste(round(n,2))), color = 'black', size=10)+
  labs(title='', x = "Discount", fill='Avg. Quantity') + theme_minimal() + 
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
    axis.title.y = element_text(size = 28), 
    plot.title = element_text(size = 16, hjust = 0.5),
    legend.text = element_text(size = 24), legend.title = element_text(size=28))


################################################################################
df4 <- df %>% mutate(Discount = cut_width(Discount,0.15,boundary=0))
df4 <- df4 %>% group_by(Discount)

plot4 <- ggplot(df4, aes(x=Discount, y= Profit, fill = Category))+
  geom_bar(stat = "identity", position = "stack") + labs(title="", x="Discount", y="Profit") + 
  theme_minimal()+
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
    axis.title.y = element_text(size = 28), 
    plot.title = element_text(size = 16, hjust = 0.5),
    legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
  scale_y_continuous(label = scales::dollar) +
  scale_fill_manual(values = c("#da6474", "#80b1d3", "#fdb462", "#7fc97f" )) +
  geom_hline(yintercept=0, linetype="dashed")

################################################################################

require(gridExtra)
grid.arrange(plot1, plot2, plot3, plot4, ncol = 2, nrow = 2)

Top left
The median profit generated by each segment oscillates around the same value for each category, and is much bigger for Technology than for Office Supplies and Furniture.

Top right
West is the region where the median profit is highest, although in South both Technology and Furniture present the same median values. For the Central region, Furniture has a negative median profit and Office Supplies shows the smallest positive median profit across all regions and categories.

Bottom left
For some subcategories there is no difference in the quantity bought whether there is a discount or not; clients buy the same quantity on average. For other subcategories, like Binders, Bookcases and Fasteners the quantity bought tends to increase with an increasing discount.

Bottom right
Profit changes when a discount is applied and we can observe that from 30% discount the profit starts turning negative.

Top 10 clients analysis

A study of the best and worst clients for the store is important to understand how to maximize the profits and take care of the clients, but also to know how to minimize the losses. The first barplot represents the top-10 clients according to purchases over the four years. The amount of orders they made is around 30, being the customer with client ID WB-21850 the one that more orders has made. These clients make most of their purchases on the office supplies category. Also, 6 out of 10 of these clients are Consumer type of client.

Next step is to analyze the top-10 clients that generate more profit and more losses (middle left and right barplots respectively). The “most profitable” clients generate around 5000 USD each, being TC-20980 (Corporate) the best client according to profit generation. Most of their purchases are on the technology category for some clients and office supplies for others. Six out of ten of these clients belong to the Consumer segment.

Clients that generate the largest negative profit cause losses that range from 2000 to 6500 USD approximately. By far, the client that causes the more losses is CS-12505 (Consumer). As with the most profitable clients, the losses are usually related to the technology and office supplies categories. Only two of these clients are from the Home Office segment, four from the Consumer segment and four from the Corporate segment.

We can observe that the top-10 clients according to orders made, generate in general less than 5 times the profit that the most “profitable” clients generate. In fact, some of them generate negative profit. This is important to check, because it’s not profitable to have a client that makes many purchases but at the end causes losses.

Finally, we plot the top-10 clients that more money spend. It is remarkable to see that none of the clients that generate the more profit is in the list of clients that more money spend. There is one client in this list however, that is also in the top-10 clients that more purchases have made. This client is the one who more money has spent over the four years and he/she belongs to the Corporate segment.

top_buyers <- tail(names(sort(table(df$Customer.ID))), 10)
df_top_buyers <- df %>% filter(df$Customer.ID %in% top_buyers)

segment_purch <- df_top_buyers %>% group_by(Customer.ID, Segment) %>% 
  summarize(Sales = sum(Sales), Profit = sum(Profit), Number.Orders = n())
#kable(segment_purch[1:10,])
segment_purch[1:10,] %>%
  kable() %>%
  kable_styling(full_width = F)

Customer.ID	Segment	Sales	Profit	Number.Orders
CK-12205	Consumer	3154.855	141.2831	32
EH-13765	Corporate	10310.880	1393.5154	32
EP-13915	Consumer	5478.061	144.9578	31
JD-15895	Corporate	7610.864	1050.2668	32
JL-15835	Consumer	9799.923	228.9070	34
MA-17560	Home Office	4299.161	1240.2266	34
PP-18955	Home Office	7252.608	1495.0854	34
SV-20365	Consumer	11470.950	1199.4242	32
WB-21850	Consumer	6160.102	714.3311	37
ZC-21910	Consumer	8025.707	-1032.1490	31

plot_top <- ggplot(df_top_buyers, aes(x = Customer.ID, fill = Category)) + 
  geom_bar(position = "stack",bins = 10) + 
  labs(x = "Clients", y="Orders") +
  theme_minimal() + 
  theme(axis.text=element_text(size=24), axis.text.x = element_text(angle=45), 
    axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),     plot.title = element_text(size = 38, hjust = 0.5),
    legend.text = element_text(size = 24), legend.title = element_text(size=28)) + 
  ggtitle("Top 10 clients according to purchases") +
  scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f")) 
  

################################################################################

plot_top_profit <- ggplot(df_top_buyers, aes(x = Customer.ID, y = Profit, fill = Category)) + 
  geom_bar(stat="identity", position = "stack", bins = 10) + 
  labs(x = "Clients", y="Profit") + 
  theme_minimal() + 
  theme(axis.text=element_text(size=24), axis.text.x = element_text(angle=45), 
    axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
    plot.title = element_text(size = 38, hjust = 0.5),
    legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
  ggtitle("Top 10 clients according to purchases") +
  scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))+
  scale_y_continuous(label = scales::dollar)+
  geom_hline(yintercept=0, linetype="dashed")

################################################################################

df_profit <- df %>% group_by(Customer.ID) %>% summarise(prof = sum(Profit))
df_profit <- df_profit[order(df_profit$prof),]
top_profit_neg <- head(df_profit$Customer.ID, 10)
top_profit_pos <- tail(df_profit$Customer.ID, 10)

df_top_pos <- df %>% filter(Customer.ID %in% top_profit_pos)

segment_prof_pos <- df_top_pos %>% group_by(Customer.ID, Segment) %>% 
  summarize(Sales = sum(Sales), Profit = sum(Profit), Number.Orders = n())
segment_prof_pos[1:10,] %>%
  kable() %>%
  kable_styling(full_width = F)

Customer.ID	Segment	Sales	Profit	Number.Orders
AB-10105	Consumer	14473.571	5444.806	20
AR-10540	Consumer	6608.448	2884.621	9
CM-12385	Consumer	8954.020	3899.890	10
DR-12940	Home Office	8350.868	2869.076	13
HL-15040	Consumer	12873.298	5622.429	11
KD-16495	Corporate	8181.256	3038.625	28
RB-19360	Consumer	15117.339	6976.096	18
SC-20095	Consumer	14142.334	5757.412	22
TA-21385	Home Office	14595.620	4703.788	10
TC-20980	Corporate	19052.218	8981.324	12

plot_top_pos <- ggplot(df_top_pos, aes(x = Customer.ID, y = Profit, fill = Category)) + 
  geom_bar(stat="identity", position = "stack", bins = 10) + 
  labs(x = "Clients", y="Profit") + 
  theme_minimal() +
  theme(axis.text=element_text(size=24), axis.text.x = element_text(angle=45), 
    axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
    plot.title = element_text(size = 38, hjust = 0.5),
    legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
  ggtitle("Top 10 clients with positive profit") +
  scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))+
  scale_y_continuous(label = scales::dollar)+
  geom_hline(yintercept=0, linetype="dashed")

################################################################################

df_top_neg <- df %>% filter(df$Customer.ID %in% top_profit_neg)

segment_prof_neg <- df_top_neg %>% group_by(Customer.ID, Segment) %>% 
  summarize(Sales = sum(Sales), Profit = sum(Profit), Number.Orders = n())
segment_prof_neg[1:10,] %>%
  kable() %>%
  kable_styling(full_width = F)

Customer.ID	Segment	Sales	Profit	Number.Orders
CP-12340	Corporate	5888.275	-1850.303	15
CS-12505	Consumer	5690.055	-6626.390	9
GT-14635	Corporate	9351.212	-4108.659	6
HG-14965	Corporate	3247.642	-2797.963	17
LF-17185	Consumer	3930.509	-3583.977	16
NC-18415	Consumer	2218.990	-2204.807	14
NF-18385	Consumer	8322.826	-1695.971	14
SB-20290	Corporate	8057.891	-2082.745	17
SM-20320	Home Office	25043.050	-1980.739	15
SR-20425	Home Office	3233.481	-3333.914	9

plot_top_neg <- ggplot(df_top_neg, aes(x = Customer.ID, y = Profit, fill = Category)) + 
  geom_bar(stat = "identity", position = "stack", bins = 10) + 
  labs(x = "Clients", y="Profit") + 
  theme_minimal() +
  theme(axis.text=element_text(size=24), axis.text.x = element_text(angle=45), 
    axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
    plot.title = element_text(size = 38, hjust = 0.5),
    legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
  ggtitle("Top 10 clients with negative profit") +
  scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))+
  scale_y_continuous(label = scales::dollar)+
  geom_hline(yintercept=0, linetype="dashed")

################################################################################

df_sales <- df %>% group_by(Customer.ID) %>% summarise(sales = sum(Sales))
df_sales <- df_profit[order(df_sales$sales),]
top_sales <- tail(df_sales$Customer.ID, 10)

df_top_sales <- df %>% filter(Customer.ID %in% top_sales)

segment_sales <- df_top_sales %>% group_by(Customer.ID, Segment) %>% 
  summarize(Sales = sum(Sales), Profit = sum(Profit), Number.Orders = n())
segment_sales[1:10,] %>%
  kable() %>%
  kable_styling(full_width = F)

Customer.ID	Segment	Sales	Profit	Number.Orders
EH-13765	Corporate	10310.880	1393.5154	32
HG-15025	Consumer	3690.284	804.6113	11
JF-15565	Consumer	4198.332	1073.2088	16
KM-16225	Corporate	2260.958	635.1057	19
MR-17545	Home Office	639.178	162.9363	6
RH-19510	Home Office	6979.180	1289.4535	12
SB-20290	Corporate	8057.891	-2082.7451	17
SM-20005	Consumer	244.494	-26.5927	6
TB-21355	Corporate	834.328	268.9670	13
TS-21205	Corporate	5371.090	862.8877	19

plot_sales <- ggplot(df_top_sales, aes(x = Customer.ID, y = Sales, fill = Category)) + 
  geom_bar(stat="identity", position = "stack", bins = 10) + 
  labs(x = "Clients", y="Sales") + 
  theme_minimal() +
  theme(axis.text=element_text(size=24), axis.text.x = element_text(angle=45), 
    axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
    plot.title = element_text(size = 38, hjust = 0.5),
    legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
  ggtitle("Top 10 clients according to sales") +
  scale_fill_manual(values = c("#da6474","#80b1d3", "#fdb462", "#7fc97f"))+
  scale_y_continuous(label = scales::dollar)

################################################################################

require(gridExtra)
grid.arrange(plot_top, plot_top_profit, plot_top_pos, plot_top_neg, plot_sales, ncol = 2, nrow = 3)

Analysis of continuous variables

In this section we intend to analyze how the different continuous variables behave with respect to the profit. We will see the distribution of Profit with respect to Sales, Time, Month and Quantity. Also we will plot Quantity vs Discount. Therefore, this section will be heavy on scatter plots, but their analysis is fundamental to find important relations in the continuous variables.

But first, let’s start by looking at how profit and sales change over time. The two plots on the left of the next figure shows how the profit and sales tend to increase over time. The two plots on the right show a moving average of the last three months. Now, it’s possible to see much better how there are cycles for sales and profit. They start low at the beginning of each year and keep increasing until November and December, after which they drop again.

by_yearmonth <- df %>% group_by(Order.Year_month)
sales_year_tot <- by_yearmonth %>% summarize(Sales_year_tot=sum(Sales))

plot1 <- ggplot(sales_year_tot, aes(x = Order.Year_month, y = Sales_year_tot)) +
  geom_line(size = 1.5, color = "lightgrey") + geom_smooth() + 
  geom_point(size = 6, color = "#80b1d3") + theme_minimal() +
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
    axis.title.y = element_text(size = 28), 
    plot.title = element_text(size = 16, hjust = 0.5)) +
  geom_point(size = 5, color = "#80b1d3") + labs(y = "Total sales", x = "Time")+
  scale_y_continuous(label = scales::dollar)

################################################################################


df_fullyear <- df %>% group_by(Order.Year_month) %>%
  summarise(m2 = mean(Sales), mx2 = max(Sales), mn2 = min(Sales) ,su2 = sum(Sales))

df_fullyear$Moving_average <- rollapply(df_fullyear$su2, 3, mean, align = "right", fill = NA)

df_fullyear[1,"Moving_average"] = df_fullyear[1,"su2"]
df_fullyear[2,"Moving_average"] = sum(df_fullyear[1,"su2"], df_fullyear[2,"su2"])/2 
#since the moving average is with the past 3 months, the first two row are NA.
# the above lines find a workaround

plot2 <- ggplot(df_fullyear, aes(x = Order.Year_month, y = Moving_average)) +
  geom_line(size = 1.5, color = "lightgrey") + geom_smooth() +
  geom_point(size = 6, color = "#80b1d3") + theme_minimal() +
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
    axis.title.y = element_text(size = 28), 
    plot.title = element_text(size = 16, hjust = 0.5)) +
  labs(y = "Avg. sales of last 3 months", x = "Time") + 
  scale_y_continuous(label = scales::dollar)

################################################################################

by_yearmonth <- df %>% group_by(Order.Year_month)
profit_year_tot <- by_yearmonth %>% summarize(Profit_year_tot=sum(Profit))

plot3 <- ggplot(profit_year_tot, aes(x = Order.Year_month, y = Profit_year_tot)) +
  geom_line(size = 1.5, color = "lightgrey") + geom_smooth() +
  geom_point(size = 6, color = "#80b1d3") + theme_minimal() +
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
    axis.title.y = element_text(size = 28), 
    plot.title = element_text(size = 16, hjust = 0.5)) +
  labs(y = "Total profit", x = "Time")+
  scale_y_continuous(label = scales::dollar)

################################################################################


df_fullyear <- df %>% group_by(Order.Year_month) %>%
  summarise(m2 = mean(Profit), mx2 = max(Profit), mn2 = min(Profit) ,su2 = sum(Profit))

df_fullyear$Moving_average <- rollapply(df_fullyear$su2, 3, mean, align = "right", fill = NA)

df_fullyear[1,"Moving_average"] = df_fullyear[1,"su2"]
df_fullyear[2,"Moving_average"] = sum(df_fullyear[1,"su2"], df_fullyear[2,"su2"])/2 
#since the moving average is with the past 3 months, the first two row are NA.
# the above lines find a workaround

plot4 <- ggplot(df_fullyear, aes(x = Order.Year_month, y = Moving_average)) +
  geom_line(size = 1.5, color = "lightgrey") + geom_smooth() +
  geom_point(size = 6, color = "#80b1d3") + theme_minimal() + 
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
    axis.title.y = element_text(size = 28), 
    plot.title = element_text(size = 16, hjust = 0.5)) +
  labs(y = "Avg. profit of last 3 months", x = "Time")+
  scale_y_continuous(label = scales::dollar)

################################################################################

require(gridExtra)
grid.arrange(plot1, plot2, plot3, plot4, ncol = 2, nrow = 2)

Profit vs Sales

The Profit vs Sale distribution has a fan shape, indicating that more sales does not necessarily implies more profit. Sometimes the profit tends to decrease. This behavior is observed for the three categories.

ggplot(df, aes(x = Sales,  y = Profit, color=Category)) +
  geom_point(size = 2, alpha=.8) +
  scale_x_continuous(label = scales::dollar) +
  scale_y_continuous(label = scales::dollar) +
  labs(x = "Sales", y = "Profit")+
  theme_minimal() + theme(axis.text=element_text(size=6), 
  axis.title.x = element_text(size = 10), axis.title.y = element_text(size = 10),
  legend.text = element_text(size = 6), legend.title = element_text(size=10))

If we now analyze the same distribution but independently for each sub category we will see that some still preserve the fan shape, but others such as Art, Copiers, Envelopes, Paper and Labels show an always increasing profit when the sales increase. Furthermore we can see that, when classifying by discount it is possible to discern for which discount ranges the profit decreases when the sales increase. Based on this, a number of suggestions we can make to the store are:

Avoid discounts over 40% for Furnishings.
Avoid discounts over 30% for Phones.
Avoid discounts over 60% for Binders.
Maximum discount recommended for Supplies, Tables and Bookcases is 10%.

# Cut width of Discount
df_disc <- df %>% mutate(Discount=cut_width(Discount,0.10,boundary=0))

df1 <- df_disc 
plot1 <- ggplot(df1, aes(x = Sales,  y = Profit, color=Discount)) +
  geom_point(size = 5, alpha=.8) +
  facet_wrap(~Sub.Category, scales = "free") +
  scale_x_continuous(label = scales::dollar) +
  scale_y_continuous(label = scales::dollar) +
  labs(x = "Sales", y = "Profit") + theme_minimal() +
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28),
  axis.text.x = element_text(angle=45),
  axis.title.y = element_text(size = 28), strip.text = element_text(size = 28),
  legend.text = element_text(size = 24), legend.title = element_text(size=28))

require(gridExtra)
grid.arrange(plot1, ncol = 1, nrow = 1)

Profit vs Time

The Profit vs Time analysis shows there is in general a positive relation for both orders and profit. With orders we can also see there is a cycle. For the first two months of the year the orders are at their lowest level. Then there is an increase in the number of orders, that is kept for the following seven months. Finally, it increases again for the last three months of the year, being November the month where the peak is reached every year. Profit, however does not show this cycle so clearly.

# Profit over time
df_plot <- df %>% group_by(Order.Year_month)
df_plot <- df_plot %>% summarize(profitt = sum(Profit))

plot1 <- ggplot(df_plot, aes(x = Order.Year_month, y = profitt)) +
  geom_point(color="#80b1d3", size = 6, alpha=.8) +
  scale_y_continuous(label = scales::dollar) +
  labs(x = "Time", y = "Total profit")+
  geom_smooth(method = "lm") + theme_minimal() +
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28), 
  axis.title.y = element_text(size = 28), plot.title = element_text(size = 38, 
  hjust = 0.5), legend.text = element_text(size = 24), 
  legend.title = element_text(size=28)) + stat_cor(method = "pearson", size = 10)

######################################################################

# Orders over time
df_plot <- df %>% group_by(Order.Year_month)
df_plot <- df_plot %>% summarize(orders = n())

plot2 <- ggplot(df_plot, aes(x = Order.Year_month, y = orders)) +
  geom_point(color="#80b1d3", size = 6, alpha=.8) +
  scale_y_continuous(label = scales::dollar) +
  labs(x = "Time", y = "Total orders")+
  geom_smooth(method = "lm") + theme_minimal() +
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28), 
  axis.title.y = element_text(size = 28), plot.title = element_text(size = 38, 
  hjust = 0.5), legend.text = element_text(size = 24), 
  legend.title = element_text(size=28)) + stat_cor(method = "pearson", size = 10)

#######################################################################

require(gridExtra)
grid.arrange(plot1, plot2, ncol = 2, nrow = 1)

If we take this analysis to each subcategory, it can be seen that Supplies, Machines and Tables show in general a decrease in the profit. In some of them it is possible to sense the cycle that we talked about before, such is the case of Appliances, Fasteners and Phones.

#BY SUBCATEGORIES
df_plot_SC <- df %>% group_by(Sub.Category, Order.Year_month)
df_plot_SC <- df_plot_SC %>% summarize(profit = sum(Profit))

ggplot(df_plot_SC, aes(x = Order.Year_month,  y = profit)) +
  geom_point(color="#80b1d3", size = 5, alpha=.8) +
  facet_wrap(~Sub.Category, scales = "free") +
  scale_y_continuous(label = scales::dollar) +
  #geom_hline(yintercept=0, color="#da6474", size=0.5)+
  labs(x = "Time", y = "Profit")+
  geom_smooth(method = "lm")+
  theme_minimal()+ theme(axis.text=element_text(size=24), 
  axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
  strip.text = element_text(size = 28), 
  legend.text = element_text(size = 24), legend.title = element_text(size=28)) +
  stat_cor(method = "pearson", size = 7)

#BY STATE
df_plot_SC <- df %>% group_by(State, Order.Year_month)
df_plot_SC <- df_plot_SC %>% summarize(profit = sum(Profit))

ggplot(df_plot_SC, aes(x = Order.Year_month,  y = profit)) +
  geom_point(color="#80b1d3", size = 5, alpha=.8) +
  facet_wrap(~State, scales = "free") +
  scale_y_continuous(label = scales::dollar) +
  #geom_hline(yintercept=0, color="#da6474", size=0.5)+
  labs(x = "Time", y = "Profit")+
  geom_smooth(method = "lm")+
  theme_minimal() + theme(axis.text=element_text(size=24), 
  axis.title.x = element_text(size = 28), axis.title.y = element_text(size = 28),
  strip.text = element_text(size = 28), axis.text.x = element_text(angle=45),
  legend.text = element_text(size = 24),  legend.title = element_text(size=28))

Grouping by state we can observe that some of the states don’t have enough data to draw conclusions from (DC, N. Dakota, W. Virginia), others show a decreasing behavior that should be payed attention to (Arkansas, Iowa, Minnesota) and others show a good positive increase of the profit over time (California, New York).

Profit vs Month

This analysis would help us confirm the cycle observed before. First months of the year come along with lower profits, which increase towards the end of the year. This is indeed what is observed for all subcategories except for Machines and possibly Supplies and Tables. Most of the states show this behavior as well, except for Ordegon, Tennessee and Georgia. Knowing this could potentially help the store address better the clients of each state, to know when to launch campaigns of offers and discounts for instance.

#BY SUBCATEGORIES
df_plot_SC <- df %>% group_by(Sub.Category, Order.Month)
df_plot_SC <- df_plot_SC %>% summarize(profit = sum(Profit))

ggplot(df_plot_SC, aes(x = Order.Month,  y = profit)) +
  geom_point(color="#80b1d3", size = 5, alpha=.8) +
  facet_wrap(~Sub.Category, scale = "free") +
  scale_y_continuous(label = scales::dollar) +
  labs(x = "Month", y = "Profit") +
  geom_smooth(method = "lm") + theme_minimal() +
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28), 
  axis.title.y = element_text(size = 28), strip.text = element_text(size = 28))+
  stat_cor(method = "pearson", size = 7)+
  scale_x_discrete(limits= seq(1,12), breaks = function(x){x[c(TRUE, FALSE)]})

#BY STATE
df_plot_SC <- df %>% group_by(State, Order.Month)
df_plot_SC <- df_plot_SC %>% summarize(profit = sum(Profit))

ggplot(df_plot_SC, aes(x = Order.Month,  y = profit)) +
  geom_point(color="#80b1d3", size = 5, alpha=.8) +
  facet_wrap(~State, scale = "free") +
  scale_y_continuous(label = scales::dollar) +
  labs(x = "Month", y = "Profit") +
  geom_smooth(method = "lm") + theme_minimal() +
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28), 
  axis.title.y = element_text(size = 28), strip.text = element_text(size = 28))+
  scale_x_discrete(limits= seq(1,12), breaks = function(x){x[c(TRUE, FALSE)]})

Profit vs Quantity

This is also an interesting relation to see. For most of the sub categories an increase in the amount of items (of the same product) bought leads to a decrease in profit. One of the reasons could be that buyers take advantage of discounts to buy in bigger quantities, hence the profit would be less when compared to single item orders when no discount was present. Or simply the store offers more discounts when the quantity bought is bigger. Next section will try to answer to this.

#BY SUBCATEGORIES
df_plot_SC <- df %>% group_by(Sub.Category, Quantity)
df_plot_SC <- df_plot_SC %>% summarize(profit = sum(Profit))

ggplot(df_plot_SC, aes(x = Quantity,  y = profit)) +
  geom_point(color="#80b1d3", size = 5, alpha=.8) +
  facet_wrap(~Sub.Category, scale = "free") +
  scale_y_continuous(label = scales::dollar) +
  #geom_hline(yintercept=0, color="#da6474", size=0.5)+
  labs(x = "Quantity", y = "Profit") +
  geom_smooth(method = "lm") + theme_minimal() + 
    theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28), 
  axis.title.y = element_text(size = 28), strip.text = element_text(size = 28)) +  stat_cor(method = "pearson", size = 7)

Quantity vs Discount

There is not enough data to draw conclusions properly, since the number of discounts is rather small. However, for Binders and Bookcases bigger discounts imply bigger quantities. For Tables the opposite behavior seems to be the case.

#BY SUBCATEGORIES
df_plot_SC <- df %>% group_by(Sub.Category, Discount)
df_plot_SC <- df_plot_SC %>% summarize(quantity = mean(Quantity))

ggplot(df_plot_SC, aes(x = Discount,  y = quantity)) +
  geom_point(color="#80b1d3", size = 5, alpha=.8) +
  facet_wrap(~Sub.Category, scale = "free") +
  labs(x = "Discount", y = "Quantity") +
  geom_smooth(method = "lm") + theme_minimal() +
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28), 
  axis.title.y = element_text(size = 28), strip.text = element_text(size = 28)) +
  stat_cor(method = "pearson", size = 7)

Finally we can see in general lines that two types of customers benefit from discounts when they buy more items or they buy bigger quantities because they are in discount. Those are Corporate and Home Office.

 #BY SEGMENT
df_plot_SC <- df %>% group_by(Segment, Discount)
df_plot_SC <- df_plot_SC %>% summarize(quantity = mean(Quantity))

ggplot(df_plot_SC, aes(x = Discount,  y = quantity)) +
  geom_point(color="#80b1d3", size = 6, alpha=.8) +
  facet_wrap(~Segment, scale = "free") +
  labs(x = "Discount", y = "Quantity") +
  geom_smooth(method = "lm") + theme_minimal() +
  theme(axis.text=element_text(size=24), axis.title.x = element_text(size = 28), 
  axis.title.y = element_text(size = 28), strip.text = element_text(size = 28)) + stat_cor(method = "pearson", size = 7)

Correlation analysis

The correlation matrix can be seen below. The highest correlation is given between Gross Margin and Discount and it is a negative correlation. It makes sense that the more discounts the store offers the lower the gross margin is since it is proportional to profit and inversely proportional to sales. The biggest the discount, the lower the profit. Same analysis explains the negative correlation between Discount and Profit. Sales and Profit have a 0.48 positive correlation, indicating that profit tends to increase with sales in general.

Also noticeable are the positive correlations between Profit and Gross Margin and between Quantity and Sales. The more items are bought of a certain product, the more sales it generates.

dat <- df[,c("Sales","Profit","Gross.Margin" , "Discount", "Quantity", "Processing.Time", "Order.Year")]
colnames(dat) <- c("Sales", "Profit","Gross.Margin", "Discount", "Quantity", "Processing.Time", "Year")

df_corr <- dplyr::select_if(dat, is.numeric)
corr_coef <- cor(df_corr, use="complete.obs")
round(corr_coef,2)

##                 Sales Profit Gross.Margin Discount Quantity Processing.Time
## Sales            1.00   0.48         0.00    -0.03     0.20           -0.01
## Profit           0.48   1.00         0.22    -0.22     0.07            0.00
## Gross.Margin     0.00   0.22         1.00    -0.86    -0.01           -0.01
## Discount        -0.03  -0.22        -0.86     1.00     0.01            0.00
## Quantity         0.20   0.07        -0.01     0.01     1.00            0.02
## Processing.Time -0.01   0.00        -0.01     0.00     0.02            1.00
## Year            -0.01   0.00         0.00     0.00    -0.01           -0.02
##                  Year
## Sales           -0.01
## Profit           0.00
## Gross.Margin     0.00
## Discount         0.00
## Quantity        -0.01
## Processing.Time -0.02
## Year             1.00

ggcorrplot(corr_coef, hc.order = TRUE, type = "lower", lab = TRUE)

We also use another library to output the 8 biggest correlations with a p-value < 0.05. The results agree with what was explained before.

dat <- df[, c("Sales","Profit","Gross.Margin" , "Discount", "Quantity", "Processing.Time", "Order.Year")]
colnames(dat) <- c("Sales", "Profit","Gross.Margin", "Disc.", "Quantity", "Processing.Time", "Year")

corr_cross(dat, max_pvalue = 0.05, top = 8, color=c("#da6474", "#80b1d3"))

Linear Regression

In this section we will address a simple linear regression problem before we get into a deeper analysis with multiple linear regression.

The original idea was to do linear regression on Profit vs Sales in general. However they are not very correlated and the shape of the distribution clearly indicates there is no linear relation. Hence, based on the EDA we realized that there seems to be a good relation between these to features for sub categories such as Envelopes, Copiers, Paper and Art. We have chosen Envelopes to model its behavior with linear regression.

First, we do a train-test split with a train set size of 80%.

df_envelop <- df %>% filter(`Sub.Category`=='Labels')

row.number <- sample(1:nrow(df_envelop), 0.8*nrow(df_envelop))
train = df_envelop[row.number,]
test = df_envelop[-row.number,]
dim(train)

## [1] 291  22

dim(test)

## [1] 73 22

Proceed to fit the model and print summary information.

lm.fit <- lm(Profit ~ Sales, data = train)
summary(lm.fit)

## 
## Call:
## lm(formula = Profit ~ Sales, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.483  -0.424   0.483   0.870  29.581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.365057   0.295019  -1.237    0.217    
## Sales        0.452852   0.003638 124.465   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.604 on 289 degrees of freedom
## Multiple R-squared:  0.9817, Adjusted R-squared:  0.9816 
## F-statistic: 1.549e+04 on 1 and 289 DF,  p-value: < 2.2e-16

confint(lm.fit)

##                  2.5 %    97.5 %
## (Intercept) -0.9457144 0.2156012
## Sales        0.4456910 0.4600133

The \(R^2\) equals 0.96, which means the independent variable is able to explain most of the variance in the dependent variable. Also, the F-statistic is quite bigger than one and its p-value very close to zero, meaning there is indeed evidence of a relation between both variables. The estimate of 0.46 gives us the expected change in Profit due to a unit change in Sales.

A plot of the fit can be seen next:

ggplot(train, aes(x = Sales, y = Profit)) +
  geom_point(color="#80b1d3", size = 2, alpha=.9) +
  scale_y_continuous(label = scales::dollar) +
  scale_x_continuous(label = scales::dollar) +
  labs(x = "Sales", y = "Profit") + theme_minimal() +
  theme(axis.text=element_text(size=8), axis.title.x = element_text(size = 12), 
  axis.title.y = element_text(size = 12)) + geom_abline(slope = coef(lm.fit)[["Sales"]], 
  intercept = coef(lm.fit)[["(Intercept)"]])

Let’s now have a look at the diagnostic plots.

par(mfrow = c(2, 2))
plot(lm.fit)

Fitted vs Residual
Residuals plots are useful to tell whether linearity and homoskedasticity hold. For linearity we expect the red line to be close to zero, while for the homoskedasticity the spread of the residuals should be approximately the same across the x axis.From the above plot, we can see that the red trend starts very close to zero but then diverges away from it. We are able to see some pattern in the data: for an increase of the fitted values some residuals increase while tend to decrease. So neither linearity nor homoskedasticity are really met. There are three potential outliers.

Normal Q-Q
In this plot we expect to see all points close to the dotted line. If this was not the case, the residuals and therefore the errors aren’t Gaussian. Thus for small sample sizes, it can’t be assumed that the estimator \(\hat{\beta}\) is Gaussian either, meaning the standard confidence intervals and significance tests are invalid. Our plot shows most points lying on the dotted line, however the rightmost and leftmost points are already far from it, suggesting strong right and left skewness.

Scale-Location
This kind of plot is ideal to check for homoskedasticity. We expect the red line to be approximately horizontal, and since this is not the case for our model, it means that the average magnitude of the standardized residuals is changing considerably as a function of the fitted values. The spread around the red line should not vary with the fitted values and again, this is happening in our plot. Therefore, we have another evidence for heteroskedasticity.

Residuals vs Leverage
Leverage measures how sensitive a fitted \(\hat{y_{i}}\) is to a sense in the true response \(y_{i}\). When we look at the Residuals vs Leverage plot we expect that the spread of standardized residuals won’t change as a function of leverage. In our plot however we see it increases. This is another proof of heteroskedasticity.Finally, points beyond the Cook’s distance have a large leverage and deleting them would have a big influence.

By the previous analysis, we can conclude that our data is not behaving linearly as we expected and it does not have homoskedasticity. We could tackle these problems by doing some transformations on the data, like taking the logarithm or the square root of the response, but in this case we have both positive and negative values, which would make the problem unnecessarily difficult and actually introduce more non-linearity and heteroskedasticity. We can’t assume however that the standard errors of the intercept and Sales are valid, given the aforementioned assumptions are not met.

To assert how the model generalizes we predict the values of the test set and calculate the RMSE. RMSE explains on an average how much of the predicted values will differ from the actual values. For a more meaningful comparison we calculated the normalized RMSE.

pred <- predict(lm.fit, newdata = test)
rmse <- sqrt(sum((pred - test$Profit)^2)/length(test$Profit))
normalized_rmse <- 100* rmse/(max(train$Profit)-min(train$Profit))
cat(paste("RMSE =", rmse))

## RMSE = 3.95568306022249

cat(paste("\nNormalized RMSE =", normalized_rmse, "%"))

## 
## Normalized RMSE = 1.02826046817739 %

The shown normalized RMSE is low enough to consider the fit as good.

Multiple linear regression

In this section we will train a regression model using multiple features. As part of the process we will select the most important features for the model using backward selection and also check how well it performs on a test set.

First, we separate our data in train and test sets, with a ratio of 80%-20% respectively.

set.seed(1)
row.number <- sample(1:nrow(df), 0.8*nrow(df))
train = df[row.number,]
test = df[-row.number,]
dim(train)

## [1] 7995   22

dim(test)

## [1] 1999   22

Next we carry out the construction of a first model, having into account 9 features that could potentially have a good impact on training and using 5-fold cross validation to enhance the performance.

lm_mult = train(
  form = Profit ~ Sales + Quantity + Segment + Discount + Processing.Time + Ship.Mode + Region + Sub.Category + Gross.Margin,
  data = train,
  trControl = trainControl(method = "cv", number = 5),
  method = "lm"
)

lm_mult

## Linear Regression 
## 
## 7995 samples
##    9 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 6397, 6395, 6395, 6396, 6397 
## Resampling results:
## 
##   RMSE      Rsquared   MAE    
##   203.5605  0.3422689  58.7417
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

summary(lm_mult)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7006.0   -24.3    -1.0    28.7  4567.9 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                4.882e+01  1.285e+01   3.799 0.000146 ***
## Sales                      1.897e-01  3.838e-03  49.437  < 2e-16 ***
## Quantity                  -2.643e+00  1.002e+00  -2.639 0.008338 ** 
## SegmentCorporate           3.606e+00  4.980e+00   0.724 0.469041    
## `SegmentHome Office`      -4.025e-01  6.002e+00  -0.067 0.946532    
## Discount                  -1.608e+02  2.552e+01  -6.301 3.12e-10 ***
## Processing.Time           -8.892e-01  2.208e+00  -0.403 0.687218    
## `Ship.ModeSame Day`       -8.838e+00  1.189e+01  -0.743 0.457207    
## `Ship.ModeSecond Class`   -1.196e+01  7.796e+00  -1.534 0.124972    
## `Ship.ModeStandard Class` -4.015e+00  8.820e+00  -0.455 0.648992    
## RegionEast                -2.073e+01  6.272e+00  -3.305 0.000954 ***
## RegionSouth               -2.883e+01  7.237e+00  -3.984 6.83e-05 ***
## RegionWest                -2.595e+01  6.215e+00  -4.176 3.00e-05 ***
## Sub.CategoryAppliances     1.415e+01  1.298e+01   1.091 0.275423    
## Sub.CategoryArt           -1.201e+01  1.102e+01  -1.090 0.275938    
## Sub.CategoryBinders        5.806e+01  1.045e+01   5.558 2.82e-08 ***
## Sub.CategoryBookcases     -7.888e+01  1.659e+01  -4.756 2.01e-06 ***
## Sub.CategoryChairs        -4.639e+01  1.198e+01  -3.874 0.000108 ***
## Sub.CategoryCopiers        4.551e+02  2.872e+01  15.846  < 2e-16 ***
## Sub.CategoryEnvelopes     -8.144e+00  1.607e+01  -0.507 0.612299    
## Sub.CategoryFasteners     -1.427e+01  1.708e+01  -0.835 0.403522    
## Sub.CategoryFurnishings   -3.247e+00  1.056e+01  -0.307 0.758492    
## Sub.CategoryLabels        -1.601e+01  1.389e+01  -1.153 0.249085    
## Sub.CategoryMachines      -2.551e+02  2.253e+01 -11.324  < 2e-16 ***
## Sub.CategoryPaper         -1.081e+01  1.006e+01  -1.074 0.282779    
## Sub.CategoryPhones        -1.315e+01  1.083e+01  -1.214 0.224659    
## Sub.CategoryStorage       -3.016e+01  1.091e+01  -2.765 0.005701 ** 
## Sub.CategorySupplies      -6.184e+01  1.738e+01  -3.558 0.000376 ***
## Sub.CategoryTables        -1.473e+02  1.489e+01  -9.893  < 2e-16 ***
## Gross.Margin               5.774e-01  1.100e-01   5.250 1.56e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 194.8 on 7965 degrees of freedom
## Multiple R-squared:  0.3634, Adjusted R-squared:  0.3611 
## F-statistic: 156.8 on 29 and 7965 DF,  p-value: < 2.2e-16

To test the relationship between predictor and response variables we will use hypothesis testing.

\[H_{0} : \beta_{1} = \beta_{2} = ··· = \beta_{10} = 0\] \[H_{a} \text{: at least one } \beta_{j} \text{ is non-zero.}\]

The F stat of 156.8 (>> 1) and a p-value of nearly zero, gives clear evidence against the null hypothesis. At least one of the selected features must be related to Profit.

Judging by the p-value of the predictor variables we can determine which are more significant. The lesser the p-value the more significant the variable is. Features such as Processing.Time, Ship.Mode and Segment are not significant for our model.

The multiple \(R^2\) indicates how much variation is captured by the model. Our value of 0.36 seems low and therefore the amount of variance in the dependent variable that the independent variables explain collectively is not very large.

The \(R^2\) tends to optimistically estimate the fit of the regression and always increases when more predictors are added. The Adjusted \(R^2\) on the other hand is used to determine how reliable the correlation is and increases when the new added predictors improve the model more than would be expected by chance. We will therefore use the adjusted \(R^2\) as the criterion of goodness of the fit. In the previously trained model, the adjusted \(R^2\) is very close to the standard \(R^2\).

To check the multicollinearity in our data we will look at the Variance Inflation Factors (VIF):

vif(lm_mult$finalModel)

##                     Sales                  Quantity          SegmentCorporate 
##                  1.345397                  1.056995                  1.110577 
##      `SegmentHome Office`                  Discount           Processing.Time 
##                  1.108745                  5.713338                  3.142746 
##       `Ship.ModeSame Day`   `Ship.ModeSecond Class` `Ship.ModeStandard Class` 
##                  1.521895                  2.002444                  3.942991 
##                RegionEast               RegionSouth                RegionWest 
##                  1.699253                  1.482735                  1.771716 
##    Sub.CategoryAppliances           Sub.CategoryArt       Sub.CategoryBinders 
##                  1.578024                  1.878397                  2.974738 
##     Sub.CategoryBookcases        Sub.CategoryChairs       Sub.CategoryCopiers 
##                  1.296510                  1.712193                  1.208971 
##     Sub.CategoryEnvelopes     Sub.CategoryFasteners   Sub.CategoryFurnishings 
##                  1.320397                  1.249844                  2.039619 
##        Sub.CategoryLabels      Sub.CategoryMachines         Sub.CategoryPaper 
##                  1.467097                  1.269015                  2.518917 
##        Sub.CategoryPhones       Sub.CategoryStorage      Sub.CategorySupplies 
##                  1.981297                  1.968884                  1.240843 
##        Sub.CategoryTables              Gross.Margin 
##                  1.425779                  5.420298

VIFs for Discount and Gross Margin are already moderately high. Since Gross Margin was calculated using the profit and the sales, we choose to keep Discount since it could provide more relevant and unique information.

Let’s now have a look at the diagnostic plots.

model = lm(Profit ~ Sales + Quantity + Segment + Discount + Processing.Time + Ship.Mode + Region + Sub.Category, data = train)
par(mfrow=c(2,2))
plot(model)

Fitted vs Residual graph
The red line is close to zero (except slightly for the rightmost values), which means the assumption of linearity holds. However the spread of the residuals should be approximately the same across the x axis, but it is not, meaning there is heteroskedasticity.

Normal Q-Q Plot
Most of the points lie in the line as expected, except for the ones at the beginning and the end (left and right skewness). There is some room for improvement here. Nevertheless, this would not represent a big problem, since given the size of our test set, the Central Limit Theorem can be invoked.

Scale-Location
The red line is far from horizontal, showing the presence of heteroskedasticity.

Residuals vs Leverage
The spread of standardized residuals changes as a function of leverage, being more spread for high leverage points. This is another proof of heteroskedasticity.

Lasso regression

Let’s also try to perform feature selection using Lasso regression. For that, we are going to take the initial model and determine which features are less important for it.

profit <- train$Profit
predictors <- data.matrix(train[, c("Sales", "Quantity", "Segment", "Discount", "Processing.Time", "Ship.Mode", "Region", "Sub.Category", "Gross.Margin")])

First, perform k-fold cross-validation to find optimal lambda value. The parameter alpha = 1 corresponds to lasso regression while alpha = 0 corresponds to ridge regression.

cv_model <- cv.glmnet(predictors, profit, alpha = 1)

#find optimal lambda value that minimizes test MSE
best_lambda <- cv_model$lambda.min
best_lambda

## [1] 1.86399

#produce plot of test MSE by lambda value
plot(cv_model)

Once the best lambda is determined we can use it to find the coefficients of the model.

best_model <- glmnet(predictors, profit, alpha = 1, lambda = best_lambda)
coef(best_model)

## 10 x 1 sparse Matrix of class "dgCMatrix"
##                          s0
## (Intercept)      44.0016113
## Sales             0.1856149
## Quantity         -1.8452988
## Segment           .        
## Discount        -64.1884105
## Processing.Time   .        
## Ship.Mode         .        
## Region           -6.0120307
## Sub.Category     -4.1847286
## Gross.Margin      0.9213551

We observe that Ship.Mode, Processing.Time and Segment are not significant for the model, therefore, we can confirm the results obtained previously.

Ridge regression

For the sake of completeness we can also check what results we get with Ridge regression on our initial set of features.

profit <- train$Profit
predictors <- data.matrix(train[, c("Sales", "Quantity", "Segment", "Discount", "Processing.Time", "Ship.Mode", "Region", "Sub.Category", "Gross.Margin")])

As with lasso regression, we find the best value of lambda:

#perform k-fold cross-validation to find optimal lambda value (alpha = 0 is ridge, alpha = 1 is lasso)
cv_model <- cv.glmnet(predictors, profit, alpha = 0)

#find optimal lambda value that minimizes test MSE
best_lambda <- cv_model$lambda.min
best_lambda

## [1] 12.26381

#produce plot of test MSE by lambda value
plot(cv_model)

Find coefficients of the ridge model:

best_model <- glmnet(predictors, profit, alpha = 0, lambda = best_lambda)
coef(best_model)

## 10 x 1 sparse Matrix of class "dgCMatrix"
##                          s0
## (Intercept)      56.3400048
## Sales             0.1795266
## Quantity         -2.2066568
## Segment           1.5221017
## Discount        -87.1680050
## Processing.Time   0.3327554
## Ship.Mode        -1.2538208
## Region           -7.3759030
## Sub.Category     -4.3348805
## Gross.Margin      0.8413572

Here we must look for those features whose coefficients are close to zero. We see that Ship.Mode, Processing.Time and Segment are, but also Gross.Margin and even more Sales. However, before discarding Sales we will check interaction and nonlinear terms.

Interaction terms

Next we are going to check some interaction terms that could potentially have a positive impact on the performance of the model. To determine if an interaction term should be included or not we will check that its p-value is lower than 0.05, its VIF is not bigger than 3 and that there is an increase in the adjusted \(R^2\). Finally we will test with ANOVA if the inclusion of the term results in a better model.

# Interactions to check:
# Sales:Quantity -----> No improvement in the model
# Sales:Discount -------------> Increases a lot the adjusted R squared, good. Vif also okay
# Ship.Mode:Processing.Time --------------- No improvement
# Sales:Sub.Category -------- Adjusted R squared increases but vif factors are not good.
# Sales:Segment -----------------------> Slight increase in R^2 and vif factors, but still okay. Won't take, since doesn't improve R so much.
# Region:Sales ---->   Increases a bit the adjusted R squared but also vif factors, so better not to take.

lm_mult.fit <- lm(Profit ~ Sales + Quantity + Segment + Gross.Margin + Processing.Time + Sales:Discount + Ship.Mode + Region + Sub.Category , data = train)
summary(lm_mult.fit)

## 
## Call:
## lm(formula = Profit ~ Sales + Quantity + Segment + Gross.Margin + 
##     Processing.Time + Sales:Discount + Ship.Mode + Region + Sub.Category, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4434.6   -14.6     2.7    26.4  2854.9 
## 
## Coefficients:
##                           Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)              9.887e+00  7.285e+00    1.357  0.17473    
## Sales                    3.833e-01  2.823e-03  135.787  < 2e-16 ***
## Quantity                -5.556e+00  6.047e-01   -9.187  < 2e-16 ***
## SegmentCorporate         6.439e-01  3.004e+00    0.214  0.83028    
## SegmentHome Office       4.331e+00  3.621e+00    1.196  0.23176    
## Gross.Margin             5.414e-01  3.398e-02   15.933  < 2e-16 ***
## Processing.Time         -8.140e-01  1.332e+00   -0.611  0.54118    
## Ship.ModeSame Day       -5.144e+00  7.171e+00   -0.717  0.47325    
## Ship.ModeSecond Class   -1.254e+01  4.701e+00   -2.668  0.00764 ** 
## Ship.ModeStandard Class -3.029e+00  5.321e+00   -0.569  0.56922    
## RegionEast              -9.965e+00  3.785e+00   -2.633  0.00848 ** 
## RegionSouth             -1.378e+01  4.367e+00   -3.155  0.00161 ** 
## RegionWest              -1.049e+01  3.746e+00   -2.800  0.00512 ** 
## Sub.CategoryAppliances   3.206e-01  7.795e+00    0.041  0.96720    
## Sub.CategoryArt          1.111e+01  6.650e+00    1.670  0.09495 .  
## Sub.CategoryBinders      4.246e+01  5.950e+00    7.137 1.04e-12 ***
## Sub.CategoryBookcases   -7.083e+01  1.001e+01   -7.079 1.58e-12 ***
## Sub.CategoryChairs      -4.353e+01  7.209e+00   -6.038 1.63e-09 ***
## Sub.CategoryCopiers      3.166e+02  1.724e+01   18.365  < 2e-16 ***
## Sub.CategoryEnvelopes    1.144e+01  9.621e+00    1.190  0.23424    
## Sub.CategoryFasteners    1.172e+01  1.030e+01    1.138  0.25495    
## Sub.CategoryFurnishings  5.641e+00  6.360e+00    0.887  0.37515    
## Sub.CategoryLabels       7.911e+00  8.318e+00    0.951  0.34154    
## Sub.CategoryMachines    -1.017e+02  1.353e+01   -7.516 6.25e-14 ***
## Sub.CategoryPaper        1.009e+01  5.974e+00    1.689  0.09118 .  
## Sub.CategoryPhones      -9.228e+00  6.503e+00   -1.419  0.15595    
## Sub.CategoryStorage     -3.661e+01  6.534e+00   -5.603 2.17e-08 ***
## Sub.CategorySupplies    -6.427e+01  1.047e+01   -6.140 8.63e-10 ***
## Sub.CategoryTables      -1.093e+02  8.950e+00  -12.210  < 2e-16 ***
## Sales:Discount          -1.166e+00  9.843e-03 -118.439  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 117.5 on 7965 degrees of freedom
## Multiple R-squared:  0.7683, Adjusted R-squared:  0.7675 
## F-statistic: 910.8 on 29 and 7965 DF,  p-value: < 2.2e-16

vif(lm_mult.fit)

##                     GVIF Df GVIF^(1/(2*Df))
## Sales           1.999385  1        1.413996
## Quantity        1.058654  1        1.028909
## Segment         1.008097  2        1.002018
## Gross.Margin    1.421714  1        1.192356
## Processing.Time 3.141738  1        1.772495
## Ship.Mode       3.176082  3        1.212408
## Region          1.110950  3        1.017691
## Sub.Category    1.723923 16        1.017164
## Sales:Discount  1.708649  1        1.307153

Based on the results for different models we keep the interaction between Sales and Discount. It makes the adjusted \(R^2\) increase significantly while keeping a good p-value and VIF. With the inclusion of this term, we realize that having Gross.Margin instead of Discount improves the adjusted \(R^2\), thus we use it instead.

We can use ANOVA test to study if the inclusion of Sales:Discount results in a superior model as compared to the previous one.

The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior.

lm_mult.fit2 <- lm(Profit ~ Sales + Sales:Discount + Quantity + Segment + Gross.Margin + Processing.Time + Ship.Mode + Region + Sub.Category , data = train)
lm_mult.fit <- lm(Profit ~ Sales + Quantity + Segment + Discount + Processing.Time + Ship.Mode + Region + Sub.Category , data = train)
anova(lm_mult.fit , lm_mult.fit2)

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
7966	303294573	NA	NA	NA	NA
7965	110009295	1	193285278	13994.43	0

The big F and zero p-value are strong evidences of the fact that including the interaction term between Sales and Discount makes the model better.

Nonlinear terms

Next step is to check nonlinear terms. Similar to the previous case, we go over different nonlinear term candidates to include in our model and base the choice on the p-value, VIF and the adjusted \(R^2\).

# Quantity^2 -----> Good pvalue but high vif factor and no improvement of R
# Discount^2 ------------> high pvalue
# Poly(Sales, 3) ------------>  good pvalue and increase in R, no problem with vif
# Poly(Quantity,5) ------------> Nope, only Quantity^2 p-value is good, and we have seen before vif are high
# log(sales) ------------> adding this term to Sales has a good p value, but no improvement on R. Replacing the Sales by log(Sales) very bad R^2
# sqrt(Sales) ------------> same as with log(sales)
# log(Quantity) ------------> same problem with log(sales) but anova shows is not that good
# sqrt(Quantity) ------------> same as log(quantity) but with pvalue in anova 3%
#sqrt(Gross.Margin) ------------> good p-value, good increase in adjusted R, good vif

lm_mult.fit <- lm(Profit ~ poly(Sales, 3) + Sales:Discount + Segment + Gross.Margin + Processing.Time + Ship.Mode + Region + Sub.Category , data = train)
summary(lm_mult.fit)

## 
## Call:
## lm(formula = Profit ~ poly(Sales, 3) + Sales:Discount + Segment + 
##     Gross.Margin + Processing.Time + Ship.Mode + Region + Sub.Category, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3665.2   -14.0    -1.3    16.1  2158.3 
## 
## Coefficients:
##                           Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)              8.280e+01  6.012e+00   13.772  < 2e-16 ***
## poly(Sales, 3)1          2.354e+04  1.532e+02  153.673  < 2e-16 ***
## poly(Sales, 3)2          6.204e+03  1.185e+02   52.365  < 2e-16 ***
## poly(Sales, 3)3          3.719e+02  1.221e+02    3.045 0.002334 ** 
## SegmentCorporate        -3.400e-01  2.603e+00   -0.131 0.896086    
## SegmentHome Office       4.285e+00  3.139e+00    1.365 0.172272    
## Gross.Margin             5.318e-01  2.964e-02   17.941  < 2e-16 ***
## Processing.Time         -1.347e+00  1.154e+00   -1.167 0.243180    
## Ship.ModeSame Day       -2.119e+00  6.213e+00   -0.341 0.733124    
## Ship.ModeSecond Class   -1.086e+01  4.073e+00   -2.666 0.007695 ** 
## Ship.ModeStandard Class -2.146e+00  4.610e+00   -0.465 0.641642    
## RegionEast              -7.443e+00  3.280e+00   -2.269 0.023287 *  
## RegionSouth             -1.287e+01  3.784e+00   -3.402 0.000673 ***
## RegionWest              -9.183e+00  3.245e+00   -2.830 0.004672 ** 
## Sub.CategoryAppliances   3.748e+00  6.755e+00    0.555 0.579030    
## Sub.CategoryArt         -7.214e+00  5.798e+00   -1.244 0.213408    
## Sub.CategoryBinders      3.395e+01  5.171e+00    6.566 5.47e-11 ***
## Sub.CategoryBookcases   -2.817e+01  8.721e+00   -3.231 0.001240 ** 
## Sub.CategoryChairs      -6.822e-01  6.322e+00   -0.108 0.914074    
## Sub.CategoryCopiers      3.853e+02  1.493e+01   25.801  < 2e-16 ***
## Sub.CategoryEnvelopes   -1.812e+00  8.356e+00   -0.217 0.828351    
## Sub.CategoryFasteners   -1.047e+01  8.951e+00   -1.170 0.242108    
## Sub.CategoryFurnishings -4.643e+00  5.523e+00   -0.841 0.400585    
## Sub.CategoryLabels      -1.059e+01  7.240e+00   -1.463 0.143459    
## Sub.CategoryMachines    -9.509e-01  1.184e+01   -0.080 0.935989    
## Sub.CategoryPaper       -5.266e+00  5.212e+00   -1.010 0.312323    
## Sub.CategoryPhones       1.390e+01  5.653e+00    2.459 0.013940 *  
## Sub.CategoryStorage     -3.018e+01  5.665e+00   -5.328 1.02e-07 ***
## Sub.CategorySupplies    -6.596e+01  9.075e+00   -7.269 3.98e-13 ***
## Sub.CategoryTables      -4.441e+01  7.872e+00   -5.641 1.75e-08 ***
## Sales:Discount          -1.379e+00  1.066e-02 -129.351  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 101.8 on 7964 degrees of freedom
## Multiple R-squared:  0.8261, Adjusted R-squared:  0.8254 
## F-statistic:  1261 on 30 and 7964 DF,  p-value: < 2.2e-16

vif(lm_mult.fit)

##                     GVIF Df GVIF^(1/(2*Df))
## poly(Sales, 3)  3.664349  3        1.241651
## Segment         1.008911  2        1.002220
## Gross.Margin    1.441088  1        1.200453
## Processing.Time 3.141824  1        1.772519
## Ship.Mode       3.174908  3        1.212333
## Region          1.110712  3        1.017654
## Sub.Category    2.101800 16        1.023484
## Sales:Discount  2.667644  1        1.633292

Taking second and third order terms of sales increases the adjusted \(R^2\). It shows also a low p-value and no problem with multicollinearity.

Let’s use ANOVA test to study Poly(Sales, 3). The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the model including the nonlinear term is superior.

lm_mult.fit2 <- lm(Profit ~ poly(Sales, 3) + Quantity + Segment + Gross.Margin + Processing.Time + Ship.Mode + Region + Sub.Category , data = train)
lm_mult.fit <- lm(Profit ~ Sales + Quantity + Segment + Gross.Margin + Processing.Time + Ship.Mode + Region + Sub.Category , data = train)

anova(lm_mult.fit , lm_mult.fit2)

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
7966	303755009	NA	NA	NA	NA
7964	255905682	2	47849327	744.5557	0

There is clear evidence that the model which includes the nonlinearity on Sales performs better than that that doesn’t include it.

Final variable selection

Based on previous analysis of Lasso and Ridge regression, interaction terms and nonlinear terms let’s create a new model selecting the most important variables and train it using 5-fold cross validation.

lm_mult = train(
  form = Profit ~ poly(Sales,3) + Sales:Discount + Sub.Category + Gross.Margin,
  data = train,
  trControl = trainControl(method = "cv", number = 5),
  method = "lm")

summary(lm_mult)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3672.2   -12.9    -2.8    15.5  2158.2 
## 
## Coefficients:
##                           Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)              6.808e+01  4.151e+00   16.401  < 2e-16 ***
## `poly(Sales, 3)1`        2.355e+04  1.532e+02  153.690  < 2e-16 ***
## `poly(Sales, 3)2`        6.208e+03  1.186e+02   52.351  < 2e-16 ***
## `poly(Sales, 3)3`        3.695e+02  1.222e+02    3.025 0.002496 ** 
## Sub.CategoryAppliances   2.878e+00  6.755e+00    0.426 0.670130    
## Sub.CategoryArt         -7.441e+00  5.802e+00   -1.283 0.199703    
## Sub.CategoryBinders      3.314e+01  5.168e+00    6.413 1.51e-10 ***
## Sub.CategoryBookcases   -2.885e+01  8.719e+00   -3.308 0.000943 ***
## Sub.CategoryChairs      -1.019e+00  6.327e+00   -0.161 0.872121    
## Sub.CategoryCopiers      3.862e+02  1.495e+01   25.838  < 2e-16 ***
## Sub.CategoryEnvelopes   -1.719e+00  8.357e+00   -0.206 0.837063    
## Sub.CategoryFasteners   -9.356e+00  8.957e+00   -1.045 0.296275    
## Sub.CategoryFurnishings -5.155e+00  5.527e+00   -0.933 0.351021    
## Sub.CategoryLabels      -9.934e+00  7.245e+00   -1.371 0.170370    
## Sub.CategoryMachines    -9.702e-01  1.184e+01   -0.082 0.934701    
## Sub.CategoryPaper       -4.545e+00  5.214e+00   -0.872 0.383408    
## Sub.CategoryPhones       1.360e+01  5.657e+00    2.404 0.016248 *  
## Sub.CategoryStorage     -3.049e+01  5.668e+00   -5.379 7.69e-08 ***
## Sub.CategorySupplies    -6.734e+01  9.080e+00   -7.416 1.33e-13 ***
## Sub.CategoryTables      -4.536e+01  7.873e+00   -5.762 8.62e-09 ***
## Gross.Margin             5.024e-01  2.833e-02   17.735  < 2e-16 ***
## `Sales:Discount`        -1.379e+00  1.066e-02 -129.363  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 102 on 7973 degrees of freedom
## Multiple R-squared:  0.8254, Adjusted R-squared:  0.825 
## F-statistic:  1795 on 21 and 7973 DF,  p-value: < 2.2e-16

The feature Region has been finally dropped because it was found to have a high p-value in the new model. The F-statistic of this model is more than 10 orders of magnitudes bigger than that of the initial model which means that the predictors in our final model are very related to the response. The final adjusted \(R^2\) is equal to 0.83, which is a quite high value and reflects the power of the model.

Check for multicollinearity:

vif(lm_mult$finalModel)

##       `poly(Sales, 3)1`       `poly(Sales, 3)2`       `poly(Sales, 3)3` 
##                2.257828                1.352624                1.435578 
##  Sub.CategoryAppliances         Sub.CategoryArt     Sub.CategoryBinders 
##                1.561027                1.901363                2.657537 
##   Sub.CategoryBookcases      Sub.CategoryChairs     Sub.CategoryCopiers 
##                1.307698                1.744446                1.194897 
##   Sub.CategoryEnvelopes   Sub.CategoryFasteners Sub.CategoryFurnishings 
##                1.303731                1.254610                2.040373 
##      Sub.CategoryLabels    Sub.CategoryMachines       Sub.CategoryPaper 
##                1.458071                1.279394                2.467364 
##      Sub.CategoryPhones     Sub.CategoryStorage    Sub.CategorySupplies 
##                1.973849                1.940746                1.235896 
##      Sub.CategoryTables            Gross.Margin        `Sales:Discount` 
##                1.455063                1.313200                2.663086

There is no multicollinearity among the final chosen features. Let’s then analyze the diagnostic plots.

par(mfrow=c(2,2))
plot(lm_mult$finalModel)

Fitted vs Residual graph
The red line is a little further from zero than in the initial model, indicating that there is less linearity, especially for values far to the right and far to the left. The spread of the residuals is not approximately the same across the x axis, indicating the presence of heteroskedasticity.

Normal Q-Q Plot
Very similar to the plot for the initial model. But as said before, this would not represent a big problem, since given the size of our test set, the Central Limit Theorem can be invoked.

Scale-Location
The red line is far from horizontal, confirming the presence of heteroskedasticity.

Residuals vs Leverage
The spread of standardized residuals changes as a function of leverage, being more spread for high leverage points. This is another proof of heteroskedasticity. There are three outliers that are found beyond Cook’s distance.

In the face of heteroskedasticity, we still have unbiased parameter estimates but we can’t trust the values of their variance.

Finally let’s use the final model to predict on the test set and calculate the RMSE.

pred <- predict(lm_mult, newdata = test)
rmse <- sqrt(sum((pred - test$Profit)^2)/length(test$Profit))
normalized_rmse <- 100 * rmse/(max(train$Profit)-min(train$Profit))
cat(paste("RMSE =", rmse))

## RMSE = 91.742018621247

cat(paste("\nNormalized RMSE =", normalized_rmse, "%"))

## 
## Normalized RMSE = 0.611615333095335 %

Even though a RMSE of 92 may seem high we need to look at the range of our response variable and normalize the RMSE to get a better sense of the meaning of the MSE. The normalized RMSE is 0.6%, which is a low enough error for the purposes of our model.

Finally let’s have a look at the Profit vs Prediction plot. As expected all values are around a slope 1 line.

Profit <- test$Profit
Prediction <- pred
plot(Profit, Prediction)

Classification

In this section we will address a binary classification problem. The idea is to make the column Profit a binary column, so that all positive profits get label Positive and all negative profits get label Negative. By doing this we can train a model that is able to predict if a purchase will result in positive or negative profit based on features such as Region, State, City, Sub-Category, Segment, Ship.Mode, Sales, Quantity, Discount and Process time.

As a first step let us define a function to compute the most typical evaluation metrics: accuracy, precision, recall and F1 score.

calc_metrics <- function(cm) {

  n = sum(cm) # number of instances
  nc = nrow(cm) # number of classes
  diag = diag(cm) # number of correctly classified instances per class 
  rowsums = rowSums(cm) # number of instances per class
  colsums = colSums(cm) # number of predictions per class
  p = rowsums / n # distribution of instances over the actual classes
  q = colsums / n # distribution of instances over the predicted classes

  accuracy = sum(diag) / n
  precision = diag / colsums 
  recall = diag / rowsums 
  f_1 = 2 * precision * recall / (precision + recall) 
   
  print(paste("Accuracy", accuracy))
  print("Precision: ")
  print(precision)
  print("Recall: ")
  print(recall)
  print("F1 score: ")
  print(f_1)

}

The column Profit is then modified and the discrete value features are converted to factors.

df_log <- df[, c("Profit","Discount", "Quantity", "Sales", "Segment", "Ship.Mode", "Region", "Order.Month", "Processing.Time", "Sub.Category", "Order.Year")]
df_log["Profit"]<-replace(df_log["Profit"], df["Profit"] < 0, "Negative")
df_log["Profit"]<-replace(df_log["Profit"], df["Profit"] >= 0, "Positive")
df_log[,"Profit"]<- as.factor(df_log[,"Profit"])
df_log[,"Segment"]<- as.factor(df_log[,"Segment"])
df_log[,"Ship.Mode"]<- as.factor(df_log[,"Ship.Mode"])
df_log[,"Region"]<- as.factor(df_log[,"Region"])
df_log[,"Sub.Category"]<- as.factor(df_log[,"Sub.Category"])

Let’s check for unbalance in the data.

table(df_log$Profit)

## 
## Negative Positive 
##     1871     8123

prop.table(table(df_log$Profit))

## 
##  Negative  Positive 
## 0.1872123 0.8127877

contrasts(df_log$Profit)

##          Positive
## Negative        0
## Positive        1

As we can see there is a moderate unbalance in our data. We are going to tackle this problem by undersampling and oversampling accordingly. But first let’s see what results we get with the unbalanced data. The train-test split is done taking as training all examples corresponding to the years 2014, 2015 and 2016, leaving 2017 as the test set.

train <- (df_log$Order.Year < 2017)
df_trn <- df_log[train, ]
df_tst <- df_log[!train, ]
dim(df_tst)

## [1] 3312   11

Profit.2017 <- df_log$Profit[!train]

Feature selection

Before training the model feature selection is done to avoid overfitting and thus ensure better results. The methodology followed to select the best features is based on the comparison of the AIC and the McFadden’s \(R^{2}\) for different trained linear models obtained by suppressing features one at a time.

The Akaike Information Criterion (AIC) is a mathematical method for evaluating how well a model fits the data it was generated from. We use it to compare different models and determine which one fits the data better. The best-fit model according to AIC is the one that explains the greatest amount of variation using the fewest possible independent variables. It is based on the number of independent variables K and the log-likelihood estimate L according to the expression:

\[ AIC = 2K - 2ln(L) \]

Usually, when comparing two models, a difference of more than 2 in their AIC value is enough to say the model with the lower AIC is better.

Since GLM models use a maximum likelihood estimator, there is no minimization of the squared error and hence no R is calculated. There are a large number of pseudo-\(R^{2}\) for GLMs, among which the most popular is McFadden’s \(R^2\). It is calculated as:

\[R^{2} = 1 - \frac{Residual Deviance}{Null Deviance}\] We will therefore look for models that present a bigger McFadden’s \(R^{2}\).

# Let's try now doing the same but with less variables (The ones with higher p value in the full fitted model)

# glm.fits1 <- glm(Profit ~  Discount + Quantity + Sales + Ship.Mode + Sub.Category , data = df_trn, family = binomial)
# glm.fits2 <- glm(Profit ~  Discount + Quantity + Sales + Ship.Mode +  Processing.Time , data = df_trn, family = binomial)
# glm.fits3 <- glm(Profit ~  Discount + Sales +  Sub.Category , data = df_trn, family = binomial)
# glm.fits4 <- glm(Profit ~  Discount + Quantity + Sales + Ship.Mode + Region + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits5 <- glm(Profit ~  Discount + Sales  + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits6 <- glm(Profit ~  Discount + Quantity + Sales + Region + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits7 <- glm(Profit ~  Discount + Quantity + Sales + Ship.Mode + Region + Order.Month + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits8 <- glm(Profit ~  Discount +  Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits9 <- glm(Profit ~  Discount + Sales + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits10 <- glm(Profit ~ Quantity + Sales + Ship.Mode + Region + Order.Month + Processing.Time + Sub.Category , data = df_trn, family = binomial)


glm.fits1 <- glm(Profit ~  Discount + Quantity + Sales + Segment + Ship.Mode + Region + Order.Month + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits2 <- glm(Profit ~  Discount +  Sales + Ship.Mode + Region + Processing.Time , data = df_trn, family = binomial) # Subcategory is important
glm.fits3 <- glm(Profit ~  Discount + Sales + Ship.Mode + Region +  Sub.Category , data = df_trn, family = binomial)
glm.fits4 <- glm(Profit ~  Discount + Sales +  Ship.Mode + Region  + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits5 <- glm(Profit ~  Discount +  Sales +  Ship.Mode + Processing.Time + Sub.Category , data = df_trn, family = binomial) # Region seems to be important
# glm.fits6 <- glm(Profit ~  Discount + Sales +  Region + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits7 <- glm(Profit ~  Discount +  Sales + Ship.Mode + Region + Processing.Time + Sub.Category , data = df_trn, family = binomial)
glm.fits8 <- glm(Profit ~  Discount +  Ship.Mode + Region + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits9 <- glm(Profit ~  Discount + Sales +  Ship.Mode + Region + Processing.Time + Sub.Category , data = df_trn, family = binomial)
# glm.fits10 <- glm(Profit ~  Quantity + Sales +  Ship.Mode + Region + Order.Month + Processing.Time + Sub.Category , data = df_trn, family = binomial) # Discount is very important

print(paste("AIC1: ", extractAIC(glm.fits1)[2]))

## [1] "AIC1:  1540.52892268126"

print(paste("McFadden's R^2  ", with(summary(glm.fits1), 1 - deviance/null.deviance)))

## [1] "McFadden's R^2   0.770234940945986"

# print(paste("AIC2: ", extractAIC(glm.fits2)[2]))
# print(paste("McFadden's R^2  ", with(summary(glm.fits2), 1 - deviance/null.deviance)))

print(paste("AIC3: ", extractAIC(glm.fits3)[2]))

## [1] "AIC3:  1537.68458296899"

print(paste("McFadden's R^2  ", with(summary(glm.fits3), 1 - deviance/null.deviance)))

## [1] "McFadden's R^2   0.769124445444416"

print(paste("AIC4: ", extractAIC(glm.fits4)[2]))

## [1] "AIC4:  1534.90359508884"

print(paste("McFadden's R^2  ", with(summary(glm.fits4), 1 - deviance/null.deviance)))

## [1] "McFadden's R^2   0.769866412687196"

# print(paste("AIC5: ", extractAIC(glm.fits5)[2]))
# print(paste("McFadden's R^2  ", with(summary(glm.fits5), 1 - deviance/null.deviance)))

# print(paste("AIC6: ", extractAIC(glm.fits6)[2]))
# print(paste("McFadden's R^2  ", with(summary(glm.fits6), 1 - deviance/null.deviance)))

# print(paste("AIC7: ", extractAIC(glm.fits7)[2]))
# print(paste("McFadden's R^2  ", with(summary(glm.fits7), 1 - deviance/null.deviance)))

print(paste("AIC8: ", extractAIC(glm.fits8)[2]))

## [1] "AIC8:  1534.24293430454"

print(paste("McFadden's R^2  ", with(summary(glm.fits8), 1 - deviance/null.deviance)))

## [1] "McFadden's R^2   0.769658559028093"

# print(paste("AIC9: ", extractAIC(glm.fits9)[2]))
# print(paste("McFadden's R^2  ", with(summary(glm.fits9), 1 - deviance/null.deviance)))

# print(paste("AIC10: ", extractAIC(glm.fits10)[2]))
# print(paste("McFadden's R^2  ", with(summary(glm.fits10), 1 - deviance/null.deviance)))

Best features found are: Discount, Sales, Sub.Category, Processing.Time and Region. We choose only those five to keep the problem simple and because the addition of other features does not improve the performance of the model.

Logistic regression

Logistic regression is the standard base classification algorithm when having linearly separable data. That’s why we’ll use it first to train our model. Furthermore, the use of 5-fold cross validation will give us better estimations of the performance of the model.

Unbalanced Dataset and Crossvalidation

glm_mod = train(
  form = Profit ~ Discount + Sales + Sub.Category + Region + Processing.Time,
  data = df_trn,
  trControl = trainControl(method = "cv", number = 5),
  method = "glm",
  family = "binomial"
)

glm_mod

## Generalized Linear Model 
## 
## 6682 samples
##    5 predictor
##    2 classes: 'Negative', 'Positive' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 5346, 5345, 5346, 5345, 5346 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9446272  0.8067708

summary(glm_mod)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.83125   0.00000   0.00118   0.05005   2.60590  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              7.057e+00  4.490e-01  15.720  < 2e-16 ***
## Discount                -3.012e+01  1.846e+00 -16.316  < 2e-16 ***
## Sales                    1.110e-04  9.633e-05   1.153  0.24906    
## Sub.CategoryAppliances   8.530e+00  8.265e+00   1.032  0.30205    
## Sub.CategoryArt          1.845e+01  6.644e+02   0.028  0.97784    
## Sub.CategoryBinders      7.067e+00  1.686e+00   4.191 2.78e-05 ***
## Sub.CategoryBookcases   -9.599e-01  3.090e-01  -3.107  0.00189 ** 
## Sub.CategoryChairs       1.937e-01  2.369e-01   0.818  0.41338    
## Sub.CategoryCopiers      2.325e+01  1.964e+03   0.012  0.99056    
## Sub.CategoryEnvelopes    1.851e+01  1.116e+03   0.017  0.98677    
## Sub.CategoryFasteners    9.761e-01  3.757e-01   2.598  0.00937 ** 
## Sub.CategoryFurnishings  1.167e+00  2.866e-01   4.073 4.63e-05 ***
## Sub.CategoryLabels       1.840e+01  9.519e+02   0.019  0.98458    
## Sub.CategoryMachines     3.804e+00  7.249e-01   5.248 1.54e-07 ***
## Sub.CategoryPaper        1.843e+01  4.990e+02   0.037  0.97054    
## Sub.CategoryPhones       2.036e+00  2.892e-01   7.039 1.93e-12 ***
## Sub.CategoryStorage     -9.076e-01  2.056e-01  -4.413 1.02e-05 ***
## Sub.CategorySupplies    -6.440e-01  3.170e-01  -2.031  0.04223 *  
## Sub.CategoryTables      -2.378e-01  3.331e-01  -0.714  0.47539    
## RegionEast              -2.676e-01  1.915e-01  -1.398  0.16222    
## RegionSouth              6.827e-02  2.199e-01   0.310  0.75620    
## RegionWest              -5.004e-01  2.048e-01  -2.443  0.01456 *  
## Processing.Time         -2.360e-02  3.751e-02  -0.629  0.52921    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6443.7  on 6681  degrees of freedom
## Residual deviance: 1495.2  on 6659  degrees of freedom
## AIC: 1541.2
## 
## Number of Fisher Scoring iterations: 19

Check McFadden’s \(R^2\).

print("McFadden's R^2")

## [1] "McFadden's R^2"

with(summary(glm_mod), 1 - deviance/null.deviance)

## [1] 0.7679639

Predict and display confusion matrix.

glm.probs <- predict(glm_mod, df_tst)
levels = c("Positive", "Negative")
cm <- as.matrix(table(glm.probs , Profit.2017)[levels,levels])
cm

##           Profit.2017
## glm.probs  Positive Negative
##   Positive     2647      146
##   Negative       45      474

Calculate evaluation metrics:

calc_metrics(cm)

## [1] "Accuracy 0.942330917874396"
## [1] "Precision: "
##  Positive  Negative 
## 0.9832838 0.7645161 
## [1] "Recall: "
##  Positive  Negative 
## 0.9477265 0.9132948 
## [1] "F1 score: "
##  Positive  Negative 
## 0.9651778 0.8323090

roc.curve(Profit.2017, glm.probs, plotit = F)

## Area under the curve (AUC): 0.874

Balanced Dataset and Crossvalidation

Now we’ll balance the dataset to determine if the classification improves for balanced data. Two approaches are used: a synthetic method and a sampling method.

Library ROSE ( Random Over Sampling Examples) is used to generate artificial data based on sampling methods and smoothed bootstrap approach and has been proved to yield better results than traditional sampling methods in many situations. It is therefore a type of oversampling technique.

On the other hand, the package ovun.sample is used to perform oversampling and undersampling in one go. The results will be compared to the synthetic method approach.

balanced_data <- ROSE(Profit ~ ., data = df_trn, seed = 1)$data
table(balanced_data$Profit)

## 
## Positive Negative 
##     3368     3314

contrasts(balanced_data$Profit)

##          Negative
## Positive        0
## Negative        1

We can see the data is now balanced with more or less the same amount of examples in each class. Now we train the model using 5-fold crossvalidation.

glm_fits = train(
  form = Profit ~ Discount + Sales + Sub.Category + Region + Processing.Time,
  data = balanced_data,
  trControl = trainControl(method = "cv", number = 5),
  method = "glm",
  family = "binomial"
)

glm_fits

## Generalized Linear Model 
## 
## 6682 samples
##    5 predictor
##    2 classes: 'Positive', 'Negative' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 5346, 5347, 5345, 5345, 5345 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9033228  0.8066274

summary(glm_fits)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.05236  -0.22592  -0.00001   0.19759   3.03605  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -2.533e+00  2.029e-01 -12.485  < 2e-16 ***
## Discount                 1.518e+01  4.529e-01  33.516  < 2e-16 ***
## Sales                   -7.874e-05  5.664e-05  -1.390 0.164424    
## Sub.CategoryAppliances  -4.022e+00  6.470e-01  -6.217 5.08e-10 ***
## Sub.CategoryArt         -1.897e+01  5.137e+02  -0.037 0.970548    
## Sub.CategoryBinders     -3.173e+00  2.493e-01 -12.725  < 2e-16 ***
## Sub.CategoryBookcases    1.346e+00  2.583e-01   5.212 1.86e-07 ***
## Sub.CategoryChairs       5.621e-01  1.745e-01   3.220 0.001280 ** 
## Sub.CategoryCopiers     -2.062e+01  1.486e+03  -0.014 0.988929    
## Sub.CategoryEnvelopes   -1.912e+01  8.761e+02  -0.022 0.982589    
## Sub.CategoryFasteners   -1.091e+00  3.314e-01  -3.293 0.000991 ***
## Sub.CategoryFurnishings -1.109e+00  2.212e-01  -5.014 5.32e-07 ***
## Sub.CategoryLabels      -1.915e+01  7.622e+02  -0.025 0.979959    
## Sub.CategoryMachines    -1.377e+00  3.750e-01  -3.674 0.000239 ***
## Sub.CategoryPaper       -1.904e+01  3.979e+02  -0.048 0.961834    
## Sub.CategoryPhones      -1.073e+00  1.820e-01  -5.898 3.69e-09 ***
## Sub.CategoryStorage      9.205e-01  1.725e-01   5.335 9.54e-08 ***
## Sub.CategorySupplies     4.291e-01  2.827e-01   1.518 0.128974    
## Sub.CategoryTables       1.313e+00  2.505e-01   5.244 1.57e-07 ***
## RegionEast               9.319e-02  1.345e-01   0.693 0.488519    
## RegionSouth             -5.819e-01  1.617e-01  -3.598 0.000320 ***
## RegionWest              -4.679e-01  1.336e-01  -3.501 0.000463 ***
## Processing.Time          4.122e-02  2.522e-02   1.634 0.102155    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9262.8  on 6681  degrees of freedom
## Residual deviance: 2810.6  on 6659  degrees of freedom
## AIC: 2856.6
## 
## Number of Fisher Scoring iterations: 18

Check McFadden’s \(R^2\).

with(summary(glm_fits), 1 - deviance/null.deviance)

## [1] 0.6965676

Predict and display confusion matrix.

glm.probs <- predict(glm_fits , df_tst)
levels = c("Positive", "Negative")
cm <- as.matrix(table(glm.probs, Profit.2017)[levels,levels])
cm

##           Profit.2017
## glm.probs  Positive Negative
##   Positive     2470       28
##   Negative      222      592

Calculate evaluation metrics:

calc_metrics(cm)

## [1] "Accuracy 0.92451690821256"
## [1] "Precision: "
##  Positive  Negative 
## 0.9175334 0.9548387 
## [1] "Recall: "
##  Positive  Negative 
## 0.9887910 0.7272727 
## [1] "F1 score: "
##  Positive  Negative 
## 0.9518304 0.8256625

roc.curve(Profit.2017, glm.probs, plotit = F)

## Area under the curve (AUC): 0.936

Let’s now see how the package ovun.sample performs compared to the previous approach. Here, we can choose to do only oversampling, only undersampling or a mix of the two. We choose a mix of the two by imposing p = 0.5, so that we end up with a 50% probability of positive class in the resulting balanced data.

balanced_data <- ovun.sample(Profit ~ ., data = df_trn, method = "both", p=0.5, N = nrow(df_trn), seed = 1)$data # Can play with parameter p: probability of positive class in newly generated sample
table(balanced_data$Profit)

## 
## Positive Negative 
##     3368     3314

Train the model using 5-fold crossvalidation.

glm_fits = train(
  form = Profit ~ Discount + Sales + Sub.Category + Region + Processing.Time,
  data = balanced_data,
  trControl = trainControl(method = "cv", number = 5),
  method = "glm",
  family = "binomial"
)

glm_fits

## Generalized Linear Model 
## 
## 6682 samples
##    5 predictor
##    2 classes: 'Positive', 'Negative' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 5346, 5346, 5345, 5345, 5346 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9313078  0.8626549

summary(glm_fits)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2821  -0.0394   0.0000   0.0419   2.3534  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -5.873e+00  3.919e-01 -14.986  < 2e-16 ***
## Discount                 3.131e+01  1.602e+00  19.544  < 2e-16 ***
## Sales                   -2.208e-04  6.687e-05  -3.302 0.000959 ***
## Sub.CategoryAppliances  -9.352e+00  9.397e+00  -0.995 0.319657    
## Sub.CategoryArt         -1.884e+01  4.906e+02  -0.038 0.969369    
## Sub.CategoryBinders     -8.130e+00  1.682e+00  -4.832 1.35e-06 ***
## Sub.CategoryBookcases    1.065e+00  3.027e-01   3.518 0.000435 ***
## Sub.CategoryChairs      -2.451e-01  2.008e-01  -1.220 0.222317    
## Sub.CategoryCopiers     -2.307e+01  1.334e+03  -0.017 0.986201    
## Sub.CategoryEnvelopes   -1.890e+01  8.524e+02  -0.022 0.982307    
## Sub.CategoryFasteners   -8.131e-01  3.111e-01  -2.614 0.008952 ** 
## Sub.CategoryFurnishings -1.136e+00  2.357e-01  -4.819 1.44e-06 ***
## Sub.CategoryLabels      -1.887e+01  7.421e+02  -0.025 0.979708    
## Sub.CategoryMachines    -3.661e+00  5.618e-01  -6.517 7.17e-11 ***
## Sub.CategoryPaper       -1.877e+01  3.826e+02  -0.049 0.960866    
## Sub.CategoryPhones      -1.744e+00  2.066e-01  -8.438  < 2e-16 ***
## Sub.CategoryStorage      1.111e+00  1.928e-01   5.764 8.21e-09 ***
## Sub.CategorySupplies     4.545e-01  3.105e-01   1.464 0.143225    
## Sub.CategoryTables      -4.067e-02  3.001e-01  -0.136 0.892200    
## RegionEast               3.674e-01  1.818e-01   2.021 0.043282 *  
## RegionSouth             -2.246e-01  1.889e-01  -1.189 0.234271    
## RegionWest               5.845e-01  1.772e-01   3.298 0.000972 ***
## Processing.Time          2.264e-02  3.320e-02   0.682 0.495262    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9262.8  on 6681  degrees of freedom
## Residual deviance: 1914.9  on 6659  degrees of freedom
## AIC: 1960.9
## 
## Number of Fisher Scoring iterations: 18

Check McFadden’s \(R^2\).

with(summary(glm_fits), 1 - deviance/null.deviance)

## [1] 0.7932703

Predict and display confusion matrix.

glm.probs <- predict(glm_fits , df_tst)
levels = c("Positive", "Negative")
cm <- as.matrix(table(glm.probs, Profit.2017)[levels,levels])
cm

##           Profit.2017
## glm.probs  Positive Negative
##   Positive     2472       33
##   Negative      220      587

Calculate evaluation metrics:

calc_metrics(cm)

## [1] "Accuracy 0.923611111111111"
## [1] "Precision: "
##  Positive  Negative 
## 0.9182764 0.9467742 
## [1] "Recall: "
##  Positive  Negative 
## 0.9868263 0.7273854 
## [1] "F1 score: "
##  Positive  Negative 
## 0.9513181 0.8227050

roc.curve(Profit.2017, glm.probs, plotit = F)

## Area under the curve (AUC): 0.933

Let’s do a quick comparison of the results for the unbalanced logistic regression and the two approaches followed to balance the data.

Method	Unbalanced	ROSE	Ovun.sample
AIC	1541.20	2856.60	1960.90
McFadden’s R^2	0.77	0.70	0.79
Accuracy	0.94	0.92	0.92
Precision	0.98	0.92	0.92
Recall	0.95	0.99	0.98
F1 score	0.97	0.95	0.95
AUC	0.87	0.94	0.93

We can see the unbalanced model performs actually really well as compared to the other models. It’s AIC is the lowest, while the evaluation metrics are close to the models trained with balanced data. However, the AUC score is higher for balanced data.

Both Ovun.sample and ROSE have a similar performance as well, but the lower AIC and higher McFadden’s \(R^2\) would make us choose the former over the latter.

Bayes classifier with Crossvalidation

Next we will try a naive Bayes classifier to compare to our logistic regression model. We use the unbalanced data since it proved to perform well with logistic regression.

nb_mod = train(
  form = Profit ~ Discount + Processing.Time + Sales + Region,
  data = df_trn,
  trControl = trainControl(method = "cv", number = 5),
  method = "nb"
)

nb_mod

## Naive Bayes 
## 
## 6682 samples
##    4 predictor
##    2 classes: 'Negative', 'Positive' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 5346, 5346, 5345, 5345, 5346 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.9319076  0.7527527
##    TRUE      0.9413348  0.7865320
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
##  = 1.

Unlike the previous models, here we have dropped the feature Sub.Category given an error in the training. We think the reason is the completeness of the data for this feature, in the sense that some Sub-Categories have one or very few points.

Predict and display confusion matrix:

nb.probs <- predict(nb_mod, df_tst)
levels = c("Positive", "Negative")
cm <- as.matrix(table(nb.probs , Profit.2017)[levels,levels])
cm

##           Profit.2017
## nb.probs   Positive Negative
##   Positive     2681      181
##   Negative       11      439

calc_metrics(cm)

## [1] "Accuracy 0.942028985507246"
## [1] "Precision: "
##  Positive  Negative 
## 0.9959138 0.7080645 
## [1] "Recall: "
##  Positive  Negative 
## 0.9367575 0.9755556 
## [1] "F1 score: "
##  Positive  Negative 
## 0.9654303 0.8205607

roc.curve(Profit.2017, nb.probs, plotit = F)

## Area under the curve (AUC): 0.852

The evaluation metrics are quite similar to those of logistic regression. Based on the slightly bigger AUC score of logistic regression we would choose that model over the naive Bayes classifier.

Linear discriminant analysis

Here we will use linear discriminant analysis to classify and model the unbalanced data. We also want to determine if it performs better than logistic regression and naive Bayes.

lda_mod = train(
  form = Profit ~ Discount + Sales + Region + Processing.Time + Sub.Category,
  data = df_trn,
  trControl = trainControl(method = "cv", number = 5),
  method = "lda"
)

lda_mod

## Linear Discriminant Analysis 
## 
## 6682 samples
##    5 predictor
##    2 classes: 'Negative', 'Positive' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 5344, 5346, 5346, 5346, 5346 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9429813  0.7962111

Predict and display confusion matrix.

lda.probs <- predict(lda_mod, df_tst)
levels = c("Positive", "Negative")
cm <- as.matrix(table(lda.probs , Profit.2017)[levels,levels])
cm

##           Profit.2017
## lda.probs  Positive Negative
##   Positive     2665      161
##   Negative       27      459

Calculate evaluation metrics:

calc_metrics(cm)

## [1] "Accuracy 0.943236714975845"
## [1] "Precision: "
##  Positive  Negative 
## 0.9899703 0.7403226 
## [1] "Recall: "
##  Positive  Negative 
## 0.9430290 0.9444444 
## [1] "F1 score: "
##  Positive  Negative 
## 0.9659297 0.8300181

roc.curve(Profit.2017, lda.probs, plotit = F)

## Area under the curve (AUC): 0.865

The evaluation metrics are as good as those for linear regression and Bayes classifier on unbalanced data.

Quadratic discriminant analysis

Finally,for the sake of completeness we use quadratic discriminant analysis to train the classification model.

qda_mod = train(
  form = Profit ~ Discount + Sales + Region + Processing.Time,
  data = df_trn,
  trControl = trainControl(method = "cv", number = 5),
  method = "qda"
)

qda_mod

## Quadratic Discriminant Analysis 
## 
## 6682 samples
##    4 predictor
##    2 classes: 'Negative', 'Positive' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 5345, 5346, 5346, 5345, 5346 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9346014  0.7655044

In the same way as with the Bayes classifier, we have dropped the feature Sub.Category given an error in the training.

Predict and display confusion matrix:

qda.probs <- predict(qda_mod, df_tst)
levels = c("Positive", "Negative")
cm <- as.matrix(table(qda.probs , Profit.2017)[levels,levels])
cm

##           Profit.2017
## qda.probs  Positive Negative
##   Positive     2656      179
##   Negative       36      441

Calculate evaluation metrics:

calc_metrics(cm)

## [1] "Accuracy 0.935084541062802"
## [1] "Precision: "
##  Positive  Negative 
## 0.9866270 0.7112903 
## [1] "Recall: "
##  Positive  Negative 
## 0.9368607 0.9245283 
## [1] "F1 score: "
##  Positive  Negative 
## 0.9611001 0.8040109

roc.curve(Profit.2017, qda.probs, plotit = F)

## Area under the curve (AUC): 0.849

Again, this model behaves similarly to the previously trained one on the unbalanced data. The AUC score is good enough to say the model achieves a good classification performance.

Analyzing a Superstore dataset

Laia Porcar, Luis Marcos López, Philippe Robert

7/9/2022

Installs and libraries

Importing Dataset

Data cleaning

Basic data cleaning

Outliers detection

Relevant questions

Exploratory Data Analysis

Analysis of discrete variables

Total orders by state

Total profit by state

Analysis of holiday and preholiday periods

Top 10 clients analysis

Analysis of continuous variables

Profit vs Sales

Profit vs Time

Profit vs Month

Profit vs Quantity

Quantity vs Discount

Correlation analysis

Linear Regression

Multiple linear regression

Lasso regression

Ridge regression

Interaction terms

Nonlinear terms

Final variable selection

Classification

Feature selection

Logistic regression

Unbalanced Dataset and Crossvalidation

Balanced Dataset and Crossvalidation

Bayes classifier with Crossvalidation

Linear discriminant analysis

Quadratic discriminant analysis