DA-HW4-Xinyu Wu
Share
Explore

# Problem 1

Progresso Soup Sales
Question 1: Winter Dummy Variable
1. Patterns of the Data
str(df_soup)
df_soup\$Month = as.factor(df_soup\$Month)

#1.1 Data Patterns
#Figure out the pattern from 5 aspects:
#region sale, month sale, cumulative region sales, cumulative month sales, and competitive price percentage.

To get access to the patterns of data, 5 aspects could be analyzed based on region sales, month sales, cumulative region sales, cumulative month sales, and competitive price percentage.

a) Region Sale
df_soup_region = aggregate(df_soup\$Sales.Progresso, by=list(df_soup\$Region), sum)
summary(df_soup_region)
barplot(df_soup_region\$x, names.arg = df_soup_region\$Group.1, main='Sales by Region')
> summary(df_soup_region)
Group.1 x
Length:4 Min. :13193184
Class :character 1st Qu.:14280310
Mode :character Median :15510411
Mean :20738548
3rd Qu.:21968649
Max. :38740183
Region Sale
0
FOUND:
East region has the highest sales and south has the lowest sales.
East region has much higher selling comparing to other three regions.
NEXT MOVE:
It is recommend to dig into the east and south data to figure out why it sells the most and the least.
Figure out why the difference between east and other region has such a large difference.

b) Month Sale
df_soup_month = aggregate(df_soup\$Sales.Progresso , by=list(df_soup\$Month), sum)
summary(df_soup_month)

> summary(df_soup_month)
Group.1 x
1 :1 Min. : 2103939
2 :1 1st Qu.: 3064259
3 :1 Median : 7506196
4 :1 Mean : 6912849
5 :1 3rd Qu.:10343104
6 :1 Max. :11921757
Month Sale
0
FOUND:
January has the highest sales and June has the lowest sales.
Winter sales (Oct, Nov, Dec, Jan, and Feb) are higher than other seasonal sale.
Summer sales’ performance (May, June, and July) is not good.
NEXT MOVE:
Find out if the selling performance related to Weather condition since winter is cold and people are inclined to drink soup.
Find out other reasons like discount or activities responding to sale change.

c) Cumulative Region Sale
df_soup_region_cul = aggregate(df_soup\$Category_Sales, by=list(df_soup\$Region), sum)
summary(df_soup_region_cul)
barplot(df_soup_region_cul\$x, names.arg = df_soup_region_cul\$Group.1, main='Cumulative Sales by Region')
> summary(df_soup_region_cul)
Group.1 x
Length:4 Min. :69744710
Class :character 1st Qu.:73754300
Mode :character Median :82707495
Mean :83664188
3rd Qu.:92617383
Max. :99497054
Cumulative Region Sale
0
FOUND:
East region has the highest sales and west has the lowest sales.
NEXT MOVE:
It is recommend to dig into the east and west data to figure out why it sells the most and the least.

d) Cumulative Month Sale
df_soup_month_cul = aggregate(df_soup\$Category_Sales , by=list(df_soup\$Month), sum)
summary(df_soup_month_cul)
barplot(df_soup_month_cul\$x, names.arg = df_soup_month_cul\$Group.1, main='Cumulative Sales by Month')
> summary(df_soup_month_cul)
Group.1 x
1 :1 Min. :14785276
2 :1 1st Qu.:18502298
3 :1 Median :29149972
4 :1 Mean :27888063
5 :1 3rd Qu.:38456885
6 :1 Max. :41466885
Cumulative Month Sale
0
FOUND:
December has the highest sales and June has the lowest sales.
Winter sales (Oct, Nov, Dec, Jan, and Feb) are still higher than other seasonal sale.
Summer sales’ performance (May, June, and July) is still not good.
NEXT MOVE:
Find out if the cumulative selling performance related to Weather condition since winter is cold and people are inclined to drink soup.
Find out other reasons like discount or activities responding to sale change.
e) Competitive Price Percentage
#Subset data to see the percentage of competitive price(low) comapring with competitors.
df_soup_com_price = subset(df_soup,
df_soup\$Price.Progresso < df_soup\$Price.Campbell &
df_soup\$Price.Progresso < df_soup\$Price.PL)
df_soup_com_price_1 = subset(df_soup,
df_soup\$Price.Progresso > df_soup\$Price.Campbell &
df_soup\$Price.Progresso < df_soup\$Price.PL)
df_soup_com_price_2 = subset(df_soup,
df_soup\$Price.Progresso < df_soup\$Price.Campbell &
df_soup\$Price.Progresso > df_soup\$Price.PL)
competitive_price_rate = nrow(df_soup_com_price)/nrow(df_soup)
competitive_price_rate
competitive_price_rate_1 = nrow(df_soup_com_price_1)/nrow(df_soup)
competitive_price_rate_2 = nrow(df_soup_com_price_2)/nrow(df_soup)
competitive_price_rate_mid = competitive_price_rate_1 + competitive_price_rate_2
competitive_price_rate_mid
FOUND:
The percentage of lowest price of Progresso among the two competitors is 6.68%.
The percentage of mid price of Progresso among the two competitors is 19.67%.
Comparing with other two competitors (Campbell and PL), Progresso usually holds a higher price.
NEXT MOVE:
Dig into higher and lower price comparing with other competitors based on different region and month.
List out the price change during different region and month for every competitors.
2. Patterns of Winter Data
#Create dummy variables for "Winter"
df_soup\$winter_Oct = ifelse(df_soup\$Month == "10",1, 0)
df_soup\$winter_Nov = ifelse(df_soup\$Month == "11",1, 0)
df_soup\$winter_Dec = ifelse(df_soup\$Month == "12",1, 0)
df_soup\$winter_Jan = ifelse(df_soup\$Month == "1",1, 0)
df_soup\$winter_Feb = ifelse(df_soup\$Month == "2",1, 0)
df_soup\$winter = ifelse(df_soup\$Month == "10" | df_soup\$Month =="11" |
df_soup\$Month == "12" | df_soup\$Month == "1" |
df_soup\$Month == "2",1, 0)
df_soup\$non_winter = ifelse(df_soup\$Month != "10" & df_soup\$Month !="11" &
df_soup\$Month != "12" & df_soup\$Month != "1" &
df_soup\$Month != "2",1, 0)

To get access to the patterns of winter data, 5 aspects could be analyzed based on region sales, month sales, cumulative region sales, cumulative month sales, and competitive price percentage.

a) Region Sale
df_soup_region_win = aggregate(df_winter\$Sales.Progresso, by=list(df_winter\$Region), sum)
summary(df_soup_region_win)
> summary(df_soup_region_win)
Group.1 x
Length:4 Min. : 8631430
Class :character 1st Qu.: 9201755
Mode :character Median : 9987724
Mean :13559340
3rd Qu.:14345310
Max. :25630483
Region Sale in Winter
0
FOUND:
East region has the highest sales and south has the lowest sales in winter.
East region has much higher selling comparing to other three regions in winter.
NEXT MOVE:
It is recommend to dig into the east and south data to figure out why it sells the most and the least in winter.
Figure out why the difference between east and other region has such a large difference for winter.
b) Month Sale
df_soup_month_win = aggregate(df_winter\$Sales.Progresso , by=list(df_winter\$Month), sum)
summary(df_soup_month_win)
> summary(df_soup_month_win)
Group.1 x
1 :1 Min. : 9805258
2 :1 1st Qu.:10316008
10 :1 Median :10424391
11 :1 Mean :10847472
12 :1 3rd Qu.:11769947
3 :0 Max. :11921757
(Other):0
Month Sale in Winter
0
FOUND:
January has the highest sales and February has the lowest sales.
NEXT MOVE:
Find out if the selling performance related to Weather condition among winter month.
c) Cumulative Region Sale
df_soup_region_cul_win = aggregate(df_winter\$Category_Sales, by=list(df_winter\$Region), sum)
summary(df_soup_region_cul_win)
> summary(df_soup_region_cul_win)
Group.1 x
Length:4 Min. :39069926
Class :character 1st Qu.:41758940
Mode :character Median :46555127
Mean :47717724
3rd Qu.:52513911
Max. :58690714
Cumulative Region Sale in Winter
0
FOUND:
East region has the highest sales and west has the lowest sales.
NEXT MOVE:
It is recommend to dig into the east and west data to figure out why it sells the most and the least.
d) Cumulative Month Sale
df_soup_month_cul_win = aggregate(df_winter\$Category_Sales , by=list(df_winter\$Month), sum)
summary(df_soup_month_cul_win)
> summary(df_soup_month_cul_win)
Group.1 x
1 :1 Min. :32237737
2 :1 1st Qu.:38143448
10 :1 Median :39397196
11 :1 Mean :38174179
12 :1 3rd Qu.:39625629
3 :0 Max. :41466885
(Other):0
Cumulative Month Sale in Winter
0
FOUND:
December has the highest sales and February has the lowest sales.
NEXT MOVE:
Find out if the cumulative selling performance related to Weather condition among winter month.
e) Competitive Price Percentage
#Subset Winter data to see the percentage of competitive price(low) comapring with competitors.
df_soup_com_price_win = subset(df_winter,
df_winter\$Price.Progresso < df_winter\$Price.Campbell &
df_winter\$Price.Progresso < df_winter\$Price.PL)
df_soup_com_price_win_1 = subset(df_winter,
df_winter\$Price.Progresso > df_winter\$Price.Campbell &
df_winter\$Price.Progresso < df_winter\$Price.PL)
df_soup_com_price_win_2 = subset(df_winter,
df_winter\$Price.Progresso < df_winter\$Price.Campbell &
df_winter\$Price.Progresso > df_winter\$Price.PL)
competitive_price_rate_win = nrow(df_soup_com_price_win)/nrow(df_soup)
competitive_price_rate_win
competitive_price_rate_win_mid = (nrow(df_soup_com_price_win_1) +nrow(df_soup_com_price_win_2))/nrow(df_soup)
competitive_price_rate_win_mid
FOUND:
The percentage of lowest price of Progresso among the two competitors is 4.68%.
The percentage of mid price of Progresso among the two competitors is 11.45%.
Comparing with other two competitors (Campbell and PL), Progresso usually holds a higher price in winter.
NEXT MOVE:
Dig into higher and lower price comparing with other competitors based on different region and winter month.
List out the price change during different region and winter month for every competitors.
3. Market Share Analysis
market_share_win = sum(df_winter\$Sales.Progresso)/sum(df_soup_month\$x)
market_share_non_win = sum(df_non_winter\$Sales.Progresso)/sum(df_soup_month\$x)
market_share_win # The market share of Winter for Progresso is about 41.89%.
market_share_non_win
FOUND:
The market share for winter months is 65.38%.
The market share for non winter months is 34.62%.
Question 2: Linear Regression Model
Linear Regression
model.soup = glm(df_soup\$Sales.Progresso ~ df_soup\$Low_Income + df_soup\$High_Income +
df_soup\$Price.Progresso + df_soup\$Price.Campbell + df_soup\$Price.PL +
df_soup\$Category_Sales + df_soup\$winter_Oct +df_soup\$winter_Nov +
df_soup\$winter_Dec + df_soup\$winter_Jan + df_soup\$winter_Feb,
data=df_soup,
family = "gaussian")
summary(model.soup)

Call:
glm(formula = df_soup\$Sales.Progresso ~ df_soup\$Low_Income +
df_soup\$High_Income + df_soup\$Price.Progresso + df_soup\$Price.Campbell +
df_soup\$Price.PL + df_soup\$Category_Sales + df_soup\$winter_Oct +
df_soup\$winter_Nov + df_soup\$winter_Dec + df_soup\$winter_Jan +
df_soup\$winter_Feb, family = "gaussian", data = df_soup)

Deviance Residuals:
Min 1Q Median 3Q Max
-10488.6 -466.3 -14.2 414.4 31287.7

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.635e+03 4.118e+01 -39.692 < 2e-16 ***
df_soup\$Low_Income 9.073e+01 1.221e+01 7.429 1.11e-13 ***
df_soup\$High_Income 2.543e+02 1.236e+01 20.581 < 2e-16 ***
df_soup\$Price.Progresso -1.906e+03 1.575e+01 -121.059 < 2e-16 ***
df_soup\$Price.Campbell 2.178e+03 2.500e+01 87.127 < 2e-16 ***
df_soup\$Price.PL 1.092e+03 2.636e+01 41.433 < 2e-16 ***
df_soup\$Category_Sales 3.114e-01 1.063e-03 292.886 < 2e-16 ***
df_soup\$winter_Oct 1.067e+02 1.751e+01 6.095 1.10e-09 ***
df_soup\$winter_Nov -1.569e+02 1.744e+01 -8.994 < 2e-16 ***
df_soup\$winter_Dec -1.572e+02 1.721e+01 -9.131 < 2e-16 ***
df_soup\$winter_Jan 2.633e+02 1.900e+01 13.861 < 2e-16 ***
df_soup\$winter_Feb 8.762e+01 1.865e+01 4.699 2.62e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 1294457)

Null deviance: 2.7017e+11 on 59099 degrees of freedom
Residual deviance: 7.6487e+10 on 59088 degrees of freedom
AIC: 999482

Number of Fisher Scoring iterations: 2
Model strength
AIC value for this model is about 999482. If we compare this result with data without winter months, the AIC is lower which better illustrates the higher strength of results.
Variable Importance
Income:
If the customer with low income, the sale would be increased by 90.73.
If the customer with high income, the sale would be increased by 254.3.
Progresso Price:
As one unit increase of Progresso price, the sale would be decreased by 1906.
Competitor Price:
As one unit increase of Campbell price, the sale would be increased by 2178.
As one unit increase of PL price, the sale would be increased by 1092.
Category Sales:
As one unit increase of category sale, the sale would be increased by 0.3114.
Winter Month:
October: As one unit increase, the sale would be increased by 106.7.
November: As one unit increase, the sale would be decreased by 156.9.
December: As one unit increase, the sale would be decreased by 157.2.
January: As one unit increase, the sale would be increased by 263.3.
February: As one unit increase, the sale would be increased by 87.62.
Relationship between predictor and dependent variables
Since all the factors are significant due to the analysis:
Relationship: Predictor vs. Sale
0
Search
Predictor
Influence
Value
1
Low Income
increased
90.73
2
High Income
increased
254.3
3
Progresso price
decreased
1906
4
Campbell price
increased
2178
5
PL price
increased
1092
6
category sale
increased
0.3114
7
October
increased
106.7
8
November
decreased
156.9
9
December
decreased
157.2
10
January
increased
263.3
11
February
increased
87.62
There are no rows in this table
Progresso sale would increase with the increase of income level, competitors’ price(Campbell and PL), category sale, October, January, and February sale.
Progresso sale would decrease with the increase of Progresso price, November, and December sale.