Skip to content
Published work
Published work

icon picker
Design and Development Document

In this document I will be describing the pipeline in code as well as MLOps practices proposed. My tone will be rather conversational and direct as my main objective is to be clear.
We want to build a system for AI/ML experimentation which allows for observability and reproducibility of experiments. Such a system allows a ML practitioner to:
run multiple experiments in parallel without fear of conflicts, or tying up the resources of their local workspace/workstation
share the results of an experiment with colleagues and teammates. As a colleague you have the means to reproduce the results of a specific run of an experiment
store all artifacts of a run for future purposes: reproducibility, writing a research paper, etc.
System Architecture
This system, in its final form, will be made up of a few main components:
A distributed cloud infrastructure, or serverless service, for running ML pipeline
source code management repository to manage versions of:
input parameters
timestamp’d query to fetch the data
infrastructure and environment configuration
data pre-processing code
training code
script to automate and orchestrate main components of a run (data loading, training, evaluation and persistence of artifacts)
A model store, responsible for managing and versioning models
Keeping track of versions of all the above is crucial for end-to-end reproducibility. Combination of preserving the timestamp’d query along with design of an append-only datalake ensures that we have a immutable query mechanism.

Roles interfacing with this system:
Platform developer: Responsible for developing the workflow and the process. Responsible for evangelizing the process with the ML team
ML team: ML team uses this system to run ML pipelines and tweak code and hyper-parameters
Observability team: Responsible for detecting issues with models/infra
business stakeholders: Monitoring progress of projects / coordinating with customers and product owners regarding milestones, etc.
MLOps practices

We want our ML practitioners to focus on developing algorithms and selecting the correct input to their modeling work. All other tasks are automated for them.
All code / configuration assets are checked into our SCM (git) system. and are versioned. When an ML practitioner makes changes to their ML code or the values in their input.json file they do so on a branch on their local machine and within their local repo. They then commit and push these changes to the remote repo (after running a test with a few epochs on their local system). The act of merging this branch into the main branch (with or without a PR) kicks off below automated processes:

Triggers the pipeline associated with the code repo. Before running the ML code the pipeline will:
build a docker container with the latest version of the ML code
update the python dependency requirements - if the content of requirements.txt file have changed
The remote cluster running the pipeline fetches updated input parameters and runs.
The output metrics and resulting model are stored in our model store
the pipeline performance is observable using Tensorboard
We can set thresholds for performance and accuracy metrics to determine whether a job was successful and hence all artifacts can be preserved, versus flagging the artifacts for deletion. We can furthermore have processes to “promote” certain jobs which means they have produced acceptable results. Only models produced by such runs can be used to serve predictions to downstream systems in production, and the reproducibility SLA would only apply to promoted jobs.
Next phases and opportunity for enhancement

Building out the remote cluster and CI/CD for building docker images running pipeline code
observability for models in production (to observe performance and detect model drift). For the purposes of the demo we are plotting the losses (training and validation) using python standard libraries. For production we will move this to a Tensorboard or similar tool. Tensorboard can be used to visualize performance metrics of both classical machine learning models (such as linear or logistic regression) as well as neural networks and deep neural networks models, and it can log scalar values as well as histograms and some other forms of data. We will be loss, error, accuracy and precision, etc.

Addressing some (possible) questions:

Q: How do we ensure consistency between a ML practitioner’s local environment and the remote training infra?
A: We will use venv in local environments for a deterministic picture of python library dependencies of a project. Venv uses pip under the hood. We use the same pip requirements.txt file for building docker images. Our remote orchestration engine (Kubernetes, Ray, etc.) will orchestrate these containers.
When a ML practitioner introduces a new dependency to their project they will commit the changes to the requirements.txt document along with the rest of their code to the remote repo.
Run the demo

This demo showcases the practice of running training code and capturing all the important input/output (on the local storage for demo purposes) using a templatized pipeline that will work independently of ML libraries and frameworks.
To run the demo please unzip the code archive and run the file. But before doing so please set up your virtual python (venv) environment following below instructions (I developed this script on a M1 mac with python v. 3.8.16. On my local machine I manage my python versions per project using pyenv, and I highly recommend it):

Open a terminal or command prompt on your system.
Navigate to the directory where your requirements.txt file is located.
Run the following command to create a new virtual environment:
python -m venv venv
This will create a new virtual environment named venv.
Activate the virtual environment by running the appropriate command for your operating system:
For Windows:
For Unix or Linux:
source venv/bin/activate
Once your virtual environment is activated, install the packages listed in the requirements.txt file by running the following command:
pip install -r requirements.txt
This will install all the required packages and their dependencies in your new virtual environment. Now you are ready to run the code. To run the code simply run python in your terminal. You can change parameter values by modifying the input.json file.

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
) instead.