This project simulates a real-world Machine Learning Engineer workflow.
It covers end-to-end development from model training, cloud storage, batch scoring on Databricks, MLflow experiment tracking, and saving scored results back to Azure Blob Storage.
Workflow Summary
1. Local Development – Simulated Data + Model Training + MLflow Tracking
Created synthetic transaction data for fraud detection (features such as amount, hour, country, device type, merchant category, international flag, etc.)
Built a complete sklearn.Pipeline with: ColumnTransformer (OneHotEncoder + StandardScaler) Tracked training using MLflow, including: Metrics (ROC AUC, accuracy, etc.) Exported the final model as fraud_model.pkl using joblib → This file is later used for batch scoring in Databricks.
2. Azure Blob Storage – Raw & Processed Data
Created a dedicated Azure Storage Account with two containers
Contains incoming transaction data fraud_new.csv Will store final scored results after Databricks processing
3. Azure Databricks – Workspace & Model Setup
Deployed a new Azure Databricks Workspace Created a compute cluster (fraud-cluster) using Databricks Runtime ML Upload the Trained Model into workspace
4. Databricks Notebook – End-to-End Batch Scoring Pipeline
Connect Databricks to Azure Blob Storage Load new data from Blob Storagen and convert Spark to Pandas Apply feature engineering inside Databricks and Predict Log scoring run to MLflow, each scoring job logs:
Save Scored Results Back to Azure Blob Storage, we can download to explore the result as csv files