Explore

Design Challenge: Credit Risk Assessment

Solve the following business problem to be considered for an interview for the 10-week AI & ML internship program.

⁠

Eligible Applicants:

Graduate or Post-graduate students in Data Science, Machine Learning, or related fields.

⁠

Business Context:

BigFin, a leading financial institution, offers various products like checking and savings accounts, mortgages, auto loans, and credit cards. To maintain financial health and customer trust, BigFin relies on accurate credit risk assessments before extending credit to its customers.

Business Problem:

BigFin currently faces challenges in accurately predicting customer creditworthiness. Traditional methods relying solely on credit scores or financial statements may not capture the full picture. This can lead to:

Increased loan defaults: Loans issued to customers with high delinquency risk are more likely to default, resulting in financial losses for BigFin.

Inefficient resource allocation: Focusing heavily on monitoring low-risk customers wastes resources.

⁠

The Challenge:

Synapse Labs is seeking Data Science and Machine Learning (ML) interns to develop a sophisticated credit risk assessment model for BigFin. This model will utilize a rich dataset containing transactional data and historical customer activity records. The goal is to:

Develop a credit risk scoring model: Build an ML model that analyzes various data points to predict the likelihood of a customer defaulting on a loan or credit line. This credit risk score will be a numerical representation of a customer's creditworthiness.

Improve credit risk assessment accuracy: By leveraging complex patterns in customer data, the model should significantly improve the precision of credit risk evaluations compared to traditional methods.

⁠

Challenge Steps:

Please use this Synthetic Dataset simulating real customer data from BigFin. You have to solve this problem using this Google Colab template:

Step 1: Data Understanding and Preparation (Get Familiar with the Data!)

Explore the data structure:

Identify the different features (variables) present in the dataset.

Understand the data types associated with each feature (e.g., numerical, categorical).

Examine the distribution of values for each feature (e.g., histograms, boxplots).

Identify potential data issues:

Check for missing values and analyze their patterns (random, concentrated).

Look for outliers that might significantly skew the data.

Identify inconsistencies or errors in formatting (e.g., typos, date formats).

Step 2: Data Cleaning (Transforming Raw Data into Gold!)

Address missing values:

Decide on an appropriate imputation strategy (e.g., mean/median imputation, category-specific imputation).

Consider removing rows with excessive missing values if imputation isn't feasible.

Handle outliers:

Decide on a strategy for handling outliers (e.g., capping values, winsorization).

Document your rationale for chosen methods.

Clean and standardize data:

Address formatting inconsistencies (e.g., standardize date formats, convert text to numeric codes).

Consider feature scaling if necessary (e.g., standardization, normalization).

Step 3: Feature Engineering (Craft Powerful Predictors!)

Analyze the data to identify potential relationships between features.

Create new features that might be more informative for predicting credit risk (e.g., debt-to-income ratio, transaction frequency).

Consider feature selection techniques to identify the most relevant features for your model (e.g., correlation analysis, feature importance).

Step 4: Model Development (Building Your Credit Risk Classifier!)

Explore different machine learning algorithms suitable for classification tasks (e.g., Logistic Regression, Random Forest, Gradient Boosting).

Train and evaluate different models using the provided Colab template.

Fine-tune hyperparameters of your chosen models to optimize performance.

Step 5: Model Evaluation (Making Sure Your Model is Top-Notch!)

Employ appropriate evaluation metrics for classification tasks (e.g., accuracy, precision, recall, F1-score).

Compare the performance of different models and justify your selection of the final model.

Perform visualizations (e.g., confusion matrix, ROC curve) to understand your model's strengths and weaknesses.

Step 6: Submit Final Solution (Showcasing Your Work!)

Submit your final solution as a Google Colab notebook, documenting all your steps.

Include clear explanations, visualizations, and code comments to demonstrate your thought process.

Briefly discuss your model's limitations and potential future improvements.

⁠

Resources:

Synthetic Data: All required files are available at:

https://drive.google.com/drive/folders/1Hbxuc-ul5P7OHJh6f8ng1lKdDKf8CwLh?usp=sharing⁠

It has 3 folders: "Train", "Test" and "Validation". Train and Test folders have target variable’s values but Validation folder doesn’t.

Metadata: Schema, description of the columns and the files in which those columns exist are present over here:

https://docs.google.com/spreadsheets/d/18GQBdHkMIzrTlzn007i-GI9QZ52DPFlqBBMl5D0mVj8/edit?usp=sharing⁠

⁠

Colab Notebook: A template Colab notebook is available here:

https://colab.research.google.com/drive/1JjS_43UuZNb6O8--F8ekbnaXRaIsO48c?usp=sharing⁠

⁠

Submission:

Submit the below files via email (to

internship@synapselabs.ai⁠

)

An excel file with “customer_id”, “account_id” and “risk_score” (from your model) for all records from Train, Test and Validation files.

Your Colab Notebook File with the complete solution.

Documentation file which should consist of your model assessment details (more information available below in

“Assessment Criteria”⁠

section).

⁠

Guidelines:

Development:

Use Python as the programming language for the solution.

Use Google Colab Notebook to create and submit your solution.

You may use existing machine learning or deep learning libraries and frameworks.

Ensure compliance with data privacy regulations and handle sensitive information appropriately.

Validation:

Validate the performance of your model on the withheld "validation" data.

Assessment Criteria

Data Exploration/Visualization: Assessment on any data exploration and visualization techniques used to understand the data, identify patterns, and gain insights to make informed decisions about your modeling approach.

Model Overfitting/Underfitting Assessment: Compare RMSE between training and testing datasets. A large disparity suggests overfitting, while consistently low performance indicates underfitting.

Model Assessment Metric: Measure the overall performance of the model using RMSE to quantify how well the model makes predictions. Use training dataset to build the model and test dataset to perform testing on your side. Validation dataset would be used by Synapse Labs to assess how well your model performs. Please ensure that your code prints RMSE for training and test data.

Model Explanations: Use techniques such as feature importance scores in tree-based models or coefficients in linear models to quantify the importance of each feature. Visualize these importances to provide a clear explanation.

Efficiency: Record the processing time taken on train, test and validation files.

Code Quality and Organization: Clarity, structure, and readability of the code.

Code Documentation: Detailed and clear documentation for each section of the code.

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.