Solve the following business problem to be considered for an interview for the 10-week AI & ML internship program.
Eligible Applicants:
Graduate or Post-graduate students in Data Science, Machine Learning, or related fields.
Business Context:
BigFin, a leading financial institution, offers various products like checking and savings accounts, mortgages, auto loans, and credit cards. To maintain financial health and customer trust, BigFin relies on accurate credit risk assessments before extending credit to its customers.
Business Problem:
BigFin currently faces challenges in accurately predicting customer creditworthiness. Traditional methods relying solely on credit scores or financial statements may not capture the full picture. This can lead to:
Increased loan defaults: Loans issued to customers with high delinquency risk are more likely to default, resulting in financial losses for BigFin.
Synapse Labs is seeking Data Science and Machine Learning (ML) interns to develop a sophisticated credit risk assessment model for BigFin. This model will utilize a rich dataset containing transactional data and historical customer activity records. The goal is to:
Develop a credit risk scoring model: Build an ML model that analyzes various data points to predict the likelihood of a customer defaulting on a loan or credit line. This credit risk score will be a numerical representation of a customer's creditworthiness.
Improve credit risk assessment accuracy: By leveraging complex patterns in customer data, the model should significantly improve the precision of credit risk evaluations compared to traditional methods.
Challenge Steps:
Please use this Synthetic Dataset simulating real customer data from BigFin. You have to solve this problem using this Google Colab template:
Step 1: Data Understanding and Preparation (Get Familiar with the Data!)
Explore the data structure:
Identify the different features (variables) present in the dataset.
Understand the data types associated with each feature (e.g., numerical, categorical).
Examine the distribution of values for each feature (e.g., histograms, boxplots).
Identify potential data issues:
Check for missing values and analyze their patterns (random, concentrated).
Look for outliers that might significantly skew the data.
Identify inconsistencies or errors in formatting (e.g., typos, date formats).
Step 2: Data Cleaning (Transforming Raw Data into Gold!)
Address missing values:
Decide on an appropriate imputation strategy (e.g., mean/median imputation, category-specific imputation).
Consider removing rows with excessive missing values if imputation isn't feasible.
Handle outliers:
Decide on a strategy for handling outliers (e.g., capping values, winsorization).
Document your rationale for chosen methods.
Clean and standardize data:
Address formatting inconsistencies (e.g., standardize date formats, convert text to numeric codes).
Consider feature scaling if necessary (e.g., standardization, normalization).
Use Python as the programming language for the solution.
Use Google Colab Notebook to create and submit your solution.
You may use existing machine learning or deep learning libraries and frameworks.
Ensure compliance with data privacy regulations and handle sensitive information appropriately.
Validation:
Validate the performance of your model on the withheld "validation" data.
Assessment Criteria
Data Exploration/Visualization: Assessment on any data exploration and visualization techniques used to understand the data, identify patterns, and gain insights to make informed decisions about your modeling approach.
Model Overfitting/Underfitting Assessment: Compare RMSE between training and testing datasets. A large disparity suggests overfitting, while consistently low performance indicates underfitting.
Model Assessment Metric: Measure the overall performance of the model using RMSE to quantify how well the model makes predictions. Use training dataset to build the model and test dataset to perform testing on your side. Validation dataset would be used by Synapse Labs to assess how well your model performs. Please ensure that your code prints RMSE for training and test data.
Model Explanations: Use techniques such as feature importance scores in tree-based models or coefficients in linear models to quantify the importance of each feature. Visualize these importances to provide a clear explanation.
Efficiency: Record the processing time taken on train, test and validation files.
Code Quality and Organization: Clarity, structure, and readability of the code.
Code Documentation: Detailed and clear documentation for each section of the code.
Want to print your doc? This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (