💵 Anomaly Detection for Bank Fraud
My individual contributions to the group term project include: literature review, data pre-processing, feature selection through principal component analysis, exploratory data analysis, and research and comparison of ML models and evaluation metrics, and business implications and recommendations.
Presentation Deck Excerpts
🗼Model Explainability for Telco Churn
While complex ML models like deep learning and ensemble methods often achieve higher accuracy, they present challenges in transparency, making it difficult for stakeholders to understand how decisions are made. This lack of interpretability is especially critical in fields such as healthcare, finance, and the justice system, where trust, fairness, and societal impact are paramount.
Model explainability techniques, such as SHapley Additive exPlanations (SHAP), help address this issue by quantifying the contribution of each feature (e.g., income, age, credit score) to the model’s prediction, providing insights into the decision-making process (). This individual project investigates churn rates for a telecommunications company using the Random Forest Classifier and SHAP. I performed key stages of the ML process, including data acquisition, preprocessing, transformation, model development, SHAP implementation, and feature importance visualizations before wrapping up with business recommendations.
Code Excerpts
Results and Discussion
Dependence Plots. Customers are more likely to churn when:
Their internet service is fiber optic as opposed to DLS or no internet. Who have tenusres < 20 years, afterwhich churn rate stabilizes and slowly decreases over time. They choose to pay via Electronic check vs. mailed check, bank transfer (automatic) and credit card (automatic). They do not have a two-year contract. They have monthly charges of 30-40 dollars, but more so when it is 70-100 and especially at 100-110. However, those 110-120 have lower SHAP values. Total charges are at the lower end < 500. SHAP values are much lower and evenly distributed when monthly charges are from 500-6000, and even lower > than that. Recommendations
Domain-Related
The top 2 feature explain 45-48% of the model, so the business can prioritize these with their time and budget.
The next set of recommendations are less important but can still impact churn by 75% in total.
Model-Related
The model performed best when there were 6 trees in the forest (n_estimators from a range of 5-25 incremented by 5), tree depth of 5 (max_depth from a range of 5-12 incremented by 2) and 5 samples required to be a leaf node (min_samples_leaf from a range of 5-20 incremented by 5).
However, this still yielded a low recall score of 48.79%, meaning our model can do better at spotting churning customers. As mentioned earlier, false negatives are a huge blow for the business not just in terms of user count, but also financially as it reduces their customer lifetime value (CLV) and increases the customer acquisition cost (CAC). It's possible that the model has overfitted due to its high accuracy of 81%.
I also observed that the features generated are slightly different from that of my groupmates, possibly because 1) I added min_samples_leaf as a hyperparameter, and 2) encoded binary values as 0 and 1 while they did -1 and 1.
I recommend future experiments
Reduce the number of features to the top 5, excluding Total Charges (check first its correlation with tenure) Prioritize improving the Random Forest model by incorporating other hyperparameters Compare SHAP with LIME and PHP but use the same data preprocessing and Random Forest settings
🛒 Association Rule Learning for Market Basket Analysis
Association rule learning is an unsupervised learning technique used to discover interesting relationships or patterns between variables in large datasets, specifically looking at how the presence of one item is dependent on the presence of another item. Its applications include web usage mining, social network analysis and sentiment analysis.
For this project, the specific use case is market basket analysis, helping grocery store owners uncover co-occurrences of items in customers’ shopping carts and better understand purchasing patterns.
In this group project, my main contributions were co-writing and streamlining code, generating insights, and developing recommendations.
Key Insights
Recommendations