Skip to content

Chadaporn's Portfolio

Chadaporn's Portfolio

More

Share

Explore

Azure + Databricks + MLflow

This project simulates a real-world Machine Learning Engineer workflow. It covers end-to-end development from model training, cloud storage, batch scoring on Databricks, MLflow experiment tracking, and saving scored results back to Azure Blob Storage.

Workflow Summary

⁠

⁠

⁠

⁠

⁠

⁠

⁠

⁠

⁠

1. Local Development – Simulated Data + Model Training + MLflow Tracking

Created synthetic transaction data for fraud detection

(features such as amount, hour, country, device type, merchant category, international flag, etc.)

Built a complete sklearn.Pipeline with:

ColumnTransformer (OneHotEncoder + StandardScaler)

RandomForestClassifier

Tracked training using MLflow, including:

Parameters

Metrics (ROC AUC, accuracy, etc.)

Full model artifacts

Exported the final model as fraud_model.pkl using joblib → This file is later used for batch scoring in Databricks.

⁠

⁠

⁠

2. Azure Blob Storage – Raw & Processed Data

Created a dedicated Azure Storage Account with two containers

raw/

Contains incoming transaction data fraud_new.csv

processed/

Will store final scored results after Databricks processing

⁠

⁠

⁠

3. Azure Databricks – Workspace & Model Setup

Deployed a new Azure Databricks Workspace

⁠

⁠

⁠

Created a compute cluster (fraud-cluster) using Databricks Runtime ML

Upload the Trained Model into workspace

⁠

⁠

⁠

4. Databricks Notebook – End-to-End Batch Scoring Pipeline

Connect Databricks to Azure Blob Storage

⁠

⁠

⁠

Load new data from Blob Storagen and convert Spark to Pandas

⁠

⁠

⁠

Apply feature engineering inside Databricks and Predict

⁠

⁠

⁠

Log scoring run to MLflow, each scoring job logs:

input_file

model_source

Process metrics:

num_records

avg_fraud_score

max_fraud_score

fraud_rate_pred

⁠

⁠

⁠

Example of Log scoring

⁠

⁠

⁠

Save Scored Results Back to Azure Blob Storage, we can download to explore the result as csv files

⁠

⁠

⁠

⁠

⁠

⁠

Workflow Summary

1. Local Development – Simulated Data + Model Training + MLflow Tracking

2. Azure Blob Storage – Raw & Processed Data

3. Azure Databricks – Workspace & Model Setup

4. Databricks Notebook – End-to-End Batch Scoring Pipeline

Want to print your doc?
This is not the way.

Try clicking the ··· in the right corner or using a keyboard shortcut (

CtrlP

) instead.