Question 1: Winter Dummy Variable
1. Patterns of the Data
df_soup = read.csv("https://raw.githubusercontent.com/jcbonilla/BusinessAnalytics/master/BAData/Progresso_Soup.csv")
str(df_soup)
df_soup$Month = as.factor(df_soup$Month)
#1.1 Data Patterns
#Figure out the pattern from 5 aspects:
#region sale, month sale, cumulative region sales, cumulative month sales, and competitive price percentage.
To get access to the patterns of data, 5 aspects could be analyzed based on region sales, month sales, cumulative region sales, cumulative month sales, and competitive price percentage.
a) Region Sale
df_soup_region = aggregate(df_soup$Sales.Progresso, by=list(df_soup$Region), sum)
summary(df_soup_region)
barplot(df_soup_region$x, names.arg = df_soup_region$Group.1, main='Sales by Region')
> summary(df_soup_region)
Group.1 x
Length:4 Min. :13193184
Class :character 1st Qu.:14280310
Mode :character Median :15510411
Mean :20738548
3rd Qu.:21968649
Max. :38740183
FOUND:
East region has the highest sales and south has the lowest sales. East region has much higher selling comparing to other three regions. NEXT MOVE:
It is recommend to dig into the east and south data to figure out why it sells the most and the least. Figure out why the difference between east and other region has such a large difference.
b) Month Sale
df_soup_month = aggregate(df_soup$Sales.Progresso , by=list(df_soup$Month), sum)
summary(df_soup_month)
> summary(df_soup_month)
Group.1 x
1 :1 Min. : 2103939
2 :1 1st Qu.: 3064259
3 :1 Median : 7506196
4 :1 Mean : 6912849
5 :1 3rd Qu.:10343104
6 :1 Max. :11921757
FOUND:
January has the highest sales and June has the lowest sales. Winter sales (Oct, Nov, Dec, Jan, and Feb) are higher than other seasonal sale. Summer sales’ performance (May, June, and July) is not good. NEXT MOVE:
Find out if the selling performance related to Weather condition since winter is cold and people are inclined to drink soup. Find out other reasons like discount or activities responding to sale change.
c) Cumulative Region Sale
df_soup_region_cul = aggregate(df_soup$Category_Sales, by=list(df_soup$Region), sum)
summary(df_soup_region_cul)
barplot(df_soup_region_cul$x, names.arg = df_soup_region_cul$Group.1, main='Cumulative Sales by Region')
> summary(df_soup_region_cul)
Group.1 x
Length:4 Min. :69744710
Class :character 1st Qu.:73754300
Mode :character Median :82707495
Mean :83664188
3rd Qu.:92617383
Max. :99497054
FOUND:
East region has the highest sales and west has the lowest sales. NEXT MOVE:
It is recommend to dig into the east and west data to figure out why it sells the most and the least.
d) Cumulative Month Sale
df_soup_month_cul = aggregate(df_soup$Category_Sales , by=list(df_soup$Month), sum)
summary(df_soup_month_cul)
barplot(df_soup_month_cul$x, names.arg = df_soup_month_cul$Group.1, main='Cumulative Sales by Month')
> summary(df_soup_month_cul)
Group.1 x
1 :1 Min. :14785276
2 :1 1st Qu.:18502298
3 :1 Median :29149972
4 :1 Mean :27888063
5 :1 3rd Qu.:38456885
6 :1 Max. :41466885
FOUND:
December has the highest sales and June has the lowest sales. Winter sales (Oct, Nov, Dec, Jan, and Feb) are still higher than other seasonal sale. Summer sales’ performance (May, June, and July) is still not good. NEXT MOVE:
Find out if the cumulative selling performance related to Weather condition since winter is cold and people are inclined to drink soup. Find out other reasons like discount or activities responding to sale change. e) Competitive Price Percentage
#Subset data to see the percentage of competitive price(low) comapring with competitors.
df_soup_com_price = subset(df_soup,
df_soup$Price.Progresso < df_soup$Price.Campbell &
df_soup$Price.Progresso < df_soup$Price.PL)
df_soup_com_price_1 = subset(df_soup,
df_soup$Price.Progresso > df_soup$Price.Campbell &
df_soup$Price.Progresso < df_soup$Price.PL)
df_soup_com_price_2 = subset(df_soup,
df_soup$Price.Progresso < df_soup$Price.Campbell &
df_soup$Price.Progresso > df_soup$Price.PL)
competitive_price_rate = nrow(df_soup_com_price)/nrow(df_soup)
competitive_price_rate
competitive_price_rate_1 = nrow(df_soup_com_price_1)/nrow(df_soup)
competitive_price_rate_2 = nrow(df_soup_com_price_2)/nrow(df_soup)
competitive_price_rate_mid = competitive_price_rate_1 + competitive_price_rate_2
competitive_price_rate_mid
FOUND:
The percentage of lowest price of Progresso among the two competitors is 6.68%. The percentage of mid price of Progresso among the two competitors is 19.67%. Comparing with other two competitors (Campbell and PL), Progresso usually holds a higher price. NEXT MOVE:
Dig into higher and lower price comparing with other competitors based on different region and month. List out the price change during different region and month for every competitors. 2. Patterns of Winter Data
#Create dummy variables for "Winter"
df_soup$winter_Oct = ifelse(df_soup$Month == "10",1, 0)
df_soup$winter_Nov = ifelse(df_soup$Month == "11",1, 0)
df_soup$winter_Dec = ifelse(df_soup$Month == "12",1, 0)
df_soup$winter_Jan = ifelse(df_soup$Month == "1",1, 0)
df_soup$winter_Feb = ifelse(df_soup$Month == "2",1, 0)
df_soup$winter = ifelse(df_soup$Month == "10" | df_soup$Month =="11" |
df_soup$Month == "12" | df_soup$Month == "1" |
df_soup$Month == "2",1, 0)
df_soup$non_winter = ifelse(df_soup$Month != "10" & df_soup$Month !="11" &
df_soup$Month != "12" & df_soup$Month != "1" &
df_soup$Month != "2",1, 0)
To get access to the patterns of winter data, 5 aspects could be analyzed based on region sales, month sales, cumulative region sales, cumulative month sales, and competitive price percentage.
a) Region Sale
df_soup_region_win = aggregate(df_winter$Sales.Progresso, by=list(df_winter$Region), sum)
summary(df_soup_region_win)
> summary(df_soup_region_win)
Group.1 x
Length:4 Min. : 8631430
Class :character 1st Qu.: 9201755
Mode :character Median : 9987724
Mean :13559340
3rd Qu.:14345310
Max. :25630483
FOUND:
East region has the highest sales and south has the lowest sales in winter. East region has much higher selling comparing to other three regions in winter. NEXT MOVE:
It is recommend to dig into the east and south data to figure out why it sells the most and the least in winter. Figure out why the difference between east and other region has such a large difference for winter. b) Month Sale
df_soup_month_win = aggregate(df_winter$Sales.Progresso , by=list(df_winter$Month), sum)
summary(df_soup_month_win)
> summary(df_soup_month_win)
Group.1 x
1 :1 Min. : 9805258
2 :1 1st Qu.:10316008
10 :1 Median :10424391
11 :1 Mean :10847472
12 :1 3rd Qu.:11769947
3 :0 Max. :11921757
(Other):0
FOUND:
January has the highest sales and February has the lowest sales. NEXT MOVE:
Find out if the selling performance related to Weather condition among winter month. c) Cumulative Region Sale
df_soup_region_cul_win = aggregate(df_winter$Category_Sales, by=list(df_winter$Region), sum)
summary(df_soup_region_cul_win)
> summary(df_soup_region_cul_win)
Group.1 x
Length:4 Min. :39069926
Class :character 1st Qu.:41758940
Mode :character Median :46555127
Mean :47717724
3rd Qu.:52513911
Max. :58690714
Cumulative Region Sale in Winter
0 FOUND:
East region has the highest sales and west has the lowest sales. NEXT MOVE:
It is recommend to dig into the east and west data to figure out why it sells the most and the least. d) Cumulative Month Sale
df_soup_month_cul_win = aggregate(df_winter$Category_Sales , by=list(df_winter$Month), sum)
summary(df_soup_month_cul_win)
> summary(df_soup_month_cul_win)
Group.1 x
1 :1 Min. :32237737
2 :1 1st Qu.:38143448
10 :1 Median :39397196
11 :1 Mean :38174179
12 :1 3rd Qu.:39625629
3 :0 Max. :41466885
(Other):0
Cumulative Month Sale in Winter
0 FOUND:
December has the highest sales and February has the lowest sales. NEXT MOVE:
Find out if the cumulative selling performance related to Weather condition among winter month. e) Competitive Price Percentage
#Subset Winter data to see the percentage of competitive price(low) comapring with competitors.
df_soup_com_price_win = subset(df_winter,
df_winter$Price.Progresso < df_winter$Price.Campbell &
df_winter$Price.Progresso < df_winter$Price.PL)