A. Background:
Dataset of an online electronic store in 2020 on Kaggle. Tool using: Python and Tableau. Understanding the dataset B. Objective
Strategic Goal: The primary aim is to decode the embedded data narratives to empower the company in sculpting informed business strategies. By asking poignant questions and applying rigorous analysis, the vision is to uncover actionable insights that bolster decision-making processes.
C. Analyzing Summary
D. Execute
What to do?
Uncovered data quality issues and executed cleaning process. Identified peak sales periods and products with the highest sales. Conducted an extensive analysis to determine customer purchasing patterns.
I. Understanding the dataset:
2,633,521 records with 8 variables Table 1: Sample of dataset
Table 2: Describe the dataset
Variables meaning and formating
II. Dealing with missing and error data:
1. Missing data
Figure 1: Missing data percentage
In the graph above, we can see that price and category_id columns have the same number of missing value. Look closely into those missing data rows, it shows some error data in the category_code and brand column: there are many numeric value on these two columns. This seems like an error in input data, those numeric values should be in the price and user_id column → So I move all error data into price and user_id respectively.
After that there are no pattern of other missing value in the dataset.
This is the dataset after modify. The price column is filled up.
Figure 2: Missing data percentage after modification
2. Error data
19,631 rows has the same event_time 1970-01-01 00:33:40 UTC, and do not have user_id. It looks like there might be an error in how the event time was converted from a UNIX format to a format that humans can understand. The suggestion is that instead of "1970," it should be "2020" in the UNIX format. This could indicate a problem with the user interface (UI) in the year 2020, which caused this incorrect date conversion.
In simpler terms, it seems like there's a mistake in the date of these rows, and it might be due to a problem in the computer system's user interface in the year 2020.
When we look closely at the event_time column, we can see that the data for the second, third, and fourth days of January, as well as part of November and the entire month of December, is missing. Table 2: Total revenue each day
III. Analyzing
Includes:
Sales Trends: Dissected sales data to pinpoint the best months, days, and hours for sales, revealing that August had the highest revenue despite June having more sales. Category Performance: Delved into category-specific analysis, demonstrating that electronics led revenue generation, accounting for nearly 49% of total sales. Purchasing Patterns: Employed association rule mining to identify frequently bought together items, providing a foundation for strategic product placement and bundled promotions. 1. Date and time best performance:
1.1. Best month to sale:
The graph above shows the best months to sales in the year:
June have the highest number in sales. But August have the highest in total revenue. → Average order value (AOV) of August is higher than June. The chart below shows AOV in each month, we can see that AOV in June is pretty low compare to others.
→ Let’s break down revenue in each category in the next part
Figure 3: No. of order and Average order value in each month
1.2. Best day in week
Figure 4: Best day in week base on Revenue
The graph above show the best day in the week to sale. We can see that customers tend to prefer shopping on weekend.
1.3. Best hour in day
Figure 5: Best hour in day base on Revenue
In this dataset, it shows that the most active time is in the morning, from 5 am to 12 pm and peak at 10 am.
→ Interpretations make us have some concern. If we have more necessary data, we can deep dive into did they surf in weekdays and decide in weekend or not; is forgein buyer make best hour in day in the morning?
But in this dataset I can deep dive into which category should be focus when in June and August have such a different in sale result.
2. Category analyze
Figure 6: Best selling category base on Revenue
Figure 7: Best selling category base on Revenue and Number of Sales in June
Figure 8: Best selling category base on Revenue and Number of Sales in August
As seen in the graph above, in June, Appliances significantly surpassed Electronics in the number of sales, yet the revenue from Appliances was only equivalent to that of Electronics. The Average Order Value (AOV) for the entire month of June also decreased significantly compared to other months, indicating that Appliances products have a lower AOV, which contributed to the overall decrease in June's AOV. This suggests the need for a campaign to boost the AOV for Appliances in general.
3. Frequently bought together
After running FPGrowth algorithm to determine which products are frequently bought together, it returns that among thousand products, only TV and Smartphone are statistical significance.
The association rule (TV) → (Smartphone) was found in 5.93% of transactions. The confidence level indicates that there is an 18.37% likelihood of smartphones being purchased when a TV is bought. Despite a lift score below 1, which suggests independence, the presence of TVs in transactions could influence smartphone sales through targeted marketing strategies. → almost 40% Category_code is missing, this can significant impact the result → ecommerce site should manage the data closely
→ can build machine learning to suggest upsell product in real time
E. Conclusion and Recommendations
Data Management: The importance of accurate and reliable database setup cannot be overstated. The analysis faced challenges due to missing and erroneous data, emphasizing the need for rigorous data integrity and quality control measures. Strategic Focus on High Revenue Months: Given that August has a higher AOV and total revenue, marketing efforts and promotions should be strategically planned around this period to maximize sales. Category-Specific Strategies: Electronics emerge as a key category for revenue generation. Tailored marketing campaigns, bundling offers, and promotions should focus on this category, along with efforts to increase the AOV of Appliances. Leveraging Purchasing Patterns: The identified association between TV and smartphone purchases, despite its statistical independence, suggests that marketing strategies such as bundled offers or discounts on smartphone purchases with TVs could stimulate additional sales. F. Further research
Due to the lack of data, this document serves as a descriptive report to provide information to users. To truly gain meaningful insights, it is necessary not only to have a more comprehensive dataset but also to integrate various other types of data to achieve a clearer overview. The following data should be considered for further examination after this report:
Analysis of purchase times throughout the day: It is essential to consider the geographical location of the users/sessions to check for appropriate time zones. Analysis of customer lifetime value to build a membership program. From the fixed cost, analyze profitability to develop promotional programs that attract users.