🗂️Link: GitHub Repository
📋Link: Project Report
1. Introduction
This project aims to classify customers into groups based on the provided data. A single CSV file of customer information, containing customer ID, gender, age, annual income, and spending score, is imported and analysed.
The dataset underwent processing stages including data loading, data cleaning, data exploration, data encoding, and data clustering using K-means and DBSCAN models with dimensionality reduction techniques.
Finally, model evaluation was performed. The insights from the data analysis and the results of the two clustering models will be presented and explained in the report.
2. Task 1: Data Loading and data exploration
2.1 Data loading
Firstly, the data file" customers.csv" was loaded into JupyterNotebook using the
code, and then the missing values and duplicated were attempted to check.
2.2 Descriptive Statistics
The column exploration is summarised as below;
Preliminary data description: From the preliminary descriptive statistics for dataset, the insight reveals that there are 200 data rows, there are more female gender information in this customer dataset, age range between 18 – 70 years (mean value is 38.85 years old), Annual income is between 15 – 137 K/year
(mean at 60.65), Spending score varies from 1 to 99 (mean value is 50.20). In addition, when considering the standard deviation value, there are variation of data in each column.
Overall distribution of Age, Annual Income, and Spending Score: Age: From Figure, the age group between 30-40 years has the highest density of customers, with more younger customers than older ones. Annual Income: Most customers earn an annual income in the range of $50-90k per year. There is a small portion of customers who earn more than $100k per year. Spending Score: The histogram suggests that most customers have a spending score around 50-60. Lower spending scores (below 20) and higher spending scores (80-100) contain quite similar numbers. Figure 1 The distribution of Age, Annual Income, and Spending Score
The Distribution of Age, Annual Income, and Spending Score based on Gender The insights are presented as the following;
Age: From Figures 2 and 3, the average ages of male and female customers are 39.8 and 38.1, respectively. Most male customers are between 30 and 40 years, while the most common age range for female customers is 30 to 35 years. Overall, younger customers are more likely to be female than male. Figure 2 The distribution of Age, Annual Income, and Spending Score based on Gender
Figure 3 The distribution of Age based on Gender
Annual Income: From Figures 2 and 4, the average annual income of males is $62.2k/year, while females average $59.2k/year. The graphs show that both genders primarily have annual incomes in the range of $70-80k/year, with the second-highest tier at $60-70k/year. Additionally, there are few customers of either gender who have incomes over $100k/year. Figure 4 The distribution of Annual Income (k$) based on Gender
Spending Score: Despite lower average annual incomes, Figures 4 and 5 reveal that female customers spend more based on higher spending score distributions. The average spending score for females is 51.5, compared to 48.5 for males. Most male customers have mean spending scores in the tier of 70-80, while most female customers have moderate scores between 40-60. In addition, both genders have the same number of customers who spend very high amounts, with spending scores between 80 and 90. Figure 5 The distribution of Spending Score (1-100) based on Gender
2.3 Correlation Analysis
the discussion of significant correlations between features are described as following;
Pair Plots: The numerical columns are plot by scatter matrix based on gender column after encoding method (female is 0, male is 1). From the Figure 6, demonstrates the distribution across the age, spending and annual income. It suggests that there is no clear correlation between Age & Annual Income, Age & Spending Score based on Gender plot. However, Annual Income & Spending Score seems to have some pattern from the scatter plot. Correlation Matrix: To explore the linear relationships between each column, from Figure 7, it illustrates weak dependencies among the data features, except for Age and Spending Score, which shows some relationship with a correlation coefficient of -0.33. Regression Plots: the Pearson correlation scores from Figure 8 suggest that there are weak linear relationships between each column. The highest score occurs at the plot between Spending Score and Age column for Female customers. However, Male and Female gender contain similar pattern when perform plotting between Spending Score and Annual Income. Figure 8 Regression plots
2.4 Data Encoding
This step involves data encoding using the method LabelEncoder(), which is applied to the "Gender" column to convert categorical data into numerical data for further model processing. After this encoding step, female and male genders are represented by "0" and "1", respectively.
2.5 Dimensionality Reduction
Principal Component Analysis (PCA) is a method used for dimensionality reduction to simplify the dataset while maintaining the most significant information. Normally, before conducting PCA, the data should be normalized first to avoid biased results. However, considering that the min-max range of this dataset is quite similar across different dimensions, PCA is applied to the dataframe without data scaling. The PCA variance visualization is illustrated in Figure 9. Figure 10 shows that selecting two features can cover more than 85% of the entire dataset. These two features are then used as the criteria for clustering in subsequent steps. After applying PCA, it results in two new features ranging from -70 to 80, as shown in Figure 10.
3. Task 2: Analysis of Predictions for the Classifier
Before start implementing clustering model, there are some justifications behind this decision
Why clustering is a right task to do for this goal? Clustering is an algorithm that involves grouping data based on their similarities. Therefore, performing customer segmentation based on their shopping behaviour and other related information can utilise the clustering model as the suitable solution. What clustering algorithm is appropriate to do the task and why? K-means and DBSCAN are chosen for this assignment as they are two popular algorithms for clustering. Additionally, there are differences between these two models. For example, DBSCAN is a density-based clustering algorithm that can discover clusters of arbitrary shapes, whereas K-means is a centroid-based clustering algorithm that assumes clusters are spherical [1]. To determine the most suitable clustering model, both models will be applied and compared in terms of their performance How does the Task1 help you to decide about clustering algorithm? Preliminary data analysis in Task 1, such as data distributions, descriptive analysis, and correlations, helps to foster an understanding of data patterns and supports the justification of clustering performance from each algorithm. The next part will discuss implementation steps, results and performance comparison between K-means and DBSCAN.
3.1 K-means
Choosing optimal K number: The method for selecting a suitable K value for K-means is Elbow method, where the point of inflection on the curve indicates the model fits best [2]. According to Figure 11, it indicates that the optimal K is 5 clusters. Figure 11 Elbow method for K-means
K-means Clustering: After identifying a suitable K number, the K-means model is performed and fitted with "pca_df," a dataframe from the above step. The clustering results and plotting are shown in Figures 12-15. The results demonstrate that the K-means model provides 5 clusters with the following number of customers in each: 82 in cluster 0, 39 in cluster 1, 22 in cluster 2, 34 in cluster 3, and 23 in cluster 4. When mapping cluster labels with the PCA dataframe (Figure 13), it reveals some patterns in this 2D space, and data are clustered separately. To find more meaningful results, the clustering labels are plotted with the original dataframe as shown in Figures 14 and 15. It clearly shows that the clustering successfully segments the customer data into 5 groups when plotting between Annual Income and Spending Score. In contrast, there is no significant pattern from plotting the Age column, and Gender also does not reveal any pattern for clustering according to binary data.
From the result, the K-means clustering model provides insights that represent the clusters as follows:
Group 0: Customers who have moderate Spending Score and moderate Annual Income. Group 1: Customers who have high Spending Score and high Annual Income. Group 2: Customers who have high Spending Score and low Annual Income. Group 3: Customers who have low Spending Score and high Annual Income. Group 4: Customers who have low Spending Score and low Annual Income. Figure 15 K -means Clusters Based on Original Dataset
K-means clustering performance: To examine performance of K-means clustering, the Silhouette Visualizer is implemented. This method produces a score between -1 and +1, where scores near +1 indicate high separation and scores [3]. From Fig 16, the score given from K-means clustering model is 0.5526, indicating a moderate level of separation between clusters in this dataset. Figure 16 Silhouette Score for K-means
3.2 DBSCAN
Choosing optimal Epsilon number: First, the n_neighbors parameter is varied within the range of 5 to 10 to identify the minimum number of points in each cluster. To determine the optimum epsilon value (the radius around each point), the NearestNeighbors method is implemented [4,5]. Through trial and error, as demonstrated in Figure 17, an Epsilon value of 10 is identified as the most suitable for the DBSCAN model with a minimum sample size of 5 and this setting provides good results compared to other parameter values Figure 17 k-Distance Graph
DBSCAN clustering: After identifying the parameters, the clustering model is executed, and the results are provided in Figures 18-21. The results reveal that the DBSCAN model provides only 4 clusters and may identify some outliers labelled as the '-1' cluster. When mapping cluster labels with the PCA dataframe in Figure 19, there are some mixed groups between clusters. Also, when plotting with the original dataset, particularly between Spending Score and Annual Income, it appears that DBSCAN cannot distinguish the data well according to the dispersion of cluster number 2. Similarly to the results from K-means, there are no significant patterns by plotting between Age and Spending Score.
DBSCAN clustering performance: The performance of DBSCAN clustering is also evaluated by the Silhouette Visualizer. From Figure 22, the DBSCAN clustering score is 0.4507, indicating a moderate to low level of separation and it provides less performance when compared to K-means. Figure 22 Silhouette Score for DBSCAN
Which algorithm work better and justify why is that the case? To compare model performance, two metrics are considered: the Silhouette score and the cluster plots with the original dataset. From the results above, it is evident that K-means achieves better performance than DBSCAN. K-means provides a higher Silhouette score of 0.5526, while the score from DBSCAN is only 0.4507. When considering the clustering plot, it also demonstrates that the results align with the score; the clustering plot with Annual Income & Spending Score shows well-isolated clusters into 5 groups, while DBSCAN results in clustering with 4 groups and some dispersed data.
The differences in results and performance might relate to the characteristics of the dataset and the suitability of different algorithms. K-Means is more suitable for this case when the data has spherical shapes, while DBSCAN performs well with data containing noise and outliers and may lack in performance if the data points or density vary greatly across different parts of the dataset [1,6].
4. Conclusions
This assignment illustrates that customer segmentation with clustering models provides an analysis of its performance using relevant metrics and visualizations. After loading and cleansing the raw data, an exploratory analysis was conducted to deepen our understanding of the data. The dataset underwent a transformation process to convert categorical data to numerical data and had the Dimensionality Reduction (Principal Component Analysis or PCA) applied, capturing only the important features while maintaining the significance of the dataset. The transformed data was then ready for the clustering model training, which involved K-means and DBSCAN.
Before model training, the optimal number of clusters (K) for K-means and the ε value for DBSCAN were identified using techniques aimed at achieving the best performance for each algorithm. Then, each model was trained and predicted the cluster labels, and the performance of each model was examined by cluster visualization and the Silhouette score for the evaluation of each model. The results clearly demonstrated that K-means provides better performance than DBSCAN for this customer dataset.
5. References
DBSCAN vs KMeans: A Guide in Python." New Horizons. Available at: . [Accessed 9 May 2024]. "Elbow Method." Scikit-Yellowbrick. Available at: [Accessed 9 May 2024]. "Silhouette Analysis." Scikit-Yellowbrick. Available at: [Accessed 9 May 2024]. Sefidian. (2022). How to Determine Epsilon and MinPts Parameters of DBSCAN Clustering. Available at: [Accessed 9 May 2024]. "Nearest Neighbours." Cave of Python. Available at: [Accessed 9 May 2024]. "DBSCAN: Make Density-Based Clusters By Hand." Towards Data Science. Available at: [Accessed 9 May 2024].