icon picker
AI Model Lifecycle Checklist

This resource provides a framework for procurement officers to evaluate and monitor AI vendor tools at each stage in the product design pipeline, after the vendor is selected. It can be used by procurement officers to solicit answers to critical questions from AI vendors at each stage of the model design lifecycle.
for AI applications, this infographic first breaks down the procurement process into an “AI Model Lifecycle,” which aligns procurement officials and AI vendors under a common design process. Next, the framework provides key questions for procurement officials to ask at each design stage for evaluating and monitoring AI products. Finally, it recommends sharable artifacts and documentation that vendors should produce at each stage of the machine learning lifecycle. These artifacts can be used during the procurement process to promote transparency between the procuring organization and the vendor.
Each section below follows a similar format for each stage of the AI Model Lifecycle:
First, there is a summary of the model lifecycle stage.
Next is a list of questions for procurement officials to ask vendors that are specific to each stage.
Then is a list of the procurement clauses in the Responsible Language Generator tool that are relevant to each stage.
Finally, there is a list of additional resources in each stage for procurement officials to learn more about advanced concepts or industry standards.

Table of Contents


The AI Design Cheat Sheet
An all-in-one infographic for questions and tools that procurement officers can use to guide contract vendors in responsible development of AI tools.
ML Lifecycle (7).png

Project Scoping

Summary
The project scoping stage is critical for defining product requirements; mapping out anticipated users, impacted communities, and key stakeholders; and establishing a framework for defining what is in scope, what is beyond the capabilities of the system, and how the tool improves upon the current status quo.
There should be a clear mission statement around the purpose of the tool, how it would be used, and what metrics are needed to evaluate its performance. A risk analysis assessment is also useful at this point to flag potential design risks for patient health and vulnerable populations that might be negatively impacted by the tool.
Potential Questions for Procurement Officers to Ask Vendors
What problem is this algorithm solving?
What communities will be impacted by this algorithm and how should risk be measured?
What alternatives were considered for solving this problem? What would be the result of doing nothing?

This phase of the project lifecycle focuses on establishing the purpose and goals of the project to anchor the technical development work. Investing time and resources into the scoping phase provides clarity to model developers when making decisions around risk tolerance, privacy and security considerations, potential sources of bias, and requirements for availability, performance speed, and precision levels for predictions. These questions ask AI vendors to offer documentation and plans that verify that they understand the problem the tool is solving, that the tool will offer improvements over the status quo, and that they are anticipating how and where the tool may run into performance issues.
Relevant Procurement Template Clauses
Provision 5 (Transparency): “Bids incorporating frameworks for soliciting user design feedback from healthcare delivery professionals like doctors and nurses, as well as patients, will be favored.”
Helpful Tools for This Stage
Impact assessments have been used in other industries to assess privacy, environmental, and human rights risks. Algorithmic impact assessments are high-level frameworks for establishing project purpose, success metrics, and potential risks at the beginning of the project.
Product requirements documents are traditionally used by stakeholders to communicate the product requirements to software developers, establishing the full set of features needed for the project release to be considered complete. This can be an effective resource for procurement officers as they establish a contract with AI vendors on the overall product deliverable.
A stakeholder map can be an effective way to map out communities that are involved or impacted by the development and usage of the AI tool. These can be helpful to procurement officers when assessing risks and sources of bias in the project scoping stage.

Additional Resources

Data Collection, Metadata, and Quality Resources

Summary
When data used in an algorithmic model is incorrect, downstream data that relies on that source will also be incorrect. The quality of features derived from that data also suffers, which means the performance of machine learning models will deteriorate during training, decreasing their predictive power. Any downstream consumers of that model will be negatively impacted as well.
Metadata is meant to help contextualize and model the data, and seeks to answer questions regarding the “who, what, when, why, and how” for the dataset.
A dataset can have :
The intended purpose of the dataset and what it measures;
The population described by the dataset;
The range of column variables used to structure information in the dataset (known as “schema”);
The schedule on which the dataset is refreshed with new data;
The lineage of the dataset that describes from where it was sourced (or whether it is the best primary source describing a person or event);
Who owns the creation, maintenance, quality, and issue resolution responsibilities for the dataset;
The preferred ways of accessing the dataset (e.g., API, database query, CSV file request, streaming data feed, etc.); and
The location of the dataset within the larger data ecosystem (including whether higher-level permissions and restrictions from the system apply to the dataset).

Sharing metadata details can help establish valuable high-level knowledge about the intended purpose of the dataset, key assumptions used in generating the data that impact its usage, and technical support processes for reporting product quality issues or requesting recovery efforts.
Knowing the metadata information necessary to answer these questions is crucial for the operation of consistent, timely, and resilient AI products.
Potential Questions for Procurement Officers to Ask
What datasets were used during training, and what kind of details are available about them?
How often is the dataset refreshed?
Is there a data retention or expiration policy?
Does the tool have the ability to delete an individual’s data?
What privacy and data protection regulations must be followed?

This phase of the model lifecycle focuses on how the input data for the model is collected, and how vendors prioritize privacy and security within their data management practices. Many regulatory compliance standards have requirements regarding the storage of data, including regular processes for deleting data, encrypting the data as it gets transported between databases, and “right to be forgotten” capabilities for deleting data about individuals. These questions for procurement officers aim to ensure that AI vendors are in compliance with the correct regulatory standards for the intended user group and have adequate documentation for describing the lineage of datasets involved in creating a model.
Relevant Procurement Template Clauses
Provision 4 (Data Quality): “Bidding contractors should provide a set of standards that they will use to define data quality, algorithmic system performance, and evaluation metrics. These standards should be used throughout the duration of the contract to guarantee a minimum level of quality and performance from the system.”
Helpful Tools for This Stage
Materials that describe data quality and metadata can be powerful, portable resources for promoting trust and transparency around datasets and pipelines that drive AI model creation. A few of the emerging interfaces for consumer-friendly metadata reporting interfaces include:
Standardized labels for summarizing dataset contents, similar to food nutrition labels.
A tool for building “report cards” assessing datasets/machine learning models, oriented for different types of data users.
An interface for reporting machine learning models’ design, metadata, and limitations that is widely used in industry.
An AI vendor might maintain their own data catalog that records the physical, operational, and team metadata attributes of their data assets. These catalogs can be valuable reports to share with procurement officers for risk assessment, regulatory compliance, and operational maintenance purposes.

Additional Resources

Data Transformation

Summary
As data is prepared for training the AI model, software developers will make decisions around how to interpret or clean the dataset. This can range from deciding on a default value to fill in missing gaps in the data, to removing unusual rows of data that may have resulted in human error. These decisions can introduce implicit bias through the developer’s judgments on the data and their understanding of the problem space.
As a recent example of how data transformation can introduce bias, an algorithm used to diagnose kidney disease was found to apply a race-based correction factor to normalize differing bio-levels of creatinine between racial groups. A found that “removing [the race-based correction factor] would shift the status of 29% of Black patients from having early-stage to advanced disease.” Before correction of this error, a significant proportion of Black patients had delayed care of kidney disease, believing their conditions were less severe .
The risk of introducing bias in this way is dependent on the specific context and is best assessed by a multidisciplinary team that can consider perspectives from engineering, social sciences, and legal precedents.
Potential Questions for Procurement Officers to Ask
What kinds of data imputation and transformation logics are being applied to the data?
Are changes in the transformation logic saved between versions for auditing purposes?
How often is the model re-trained?
How do you check for data quality issues like missing, incomplete, or duplicate data?
Are there demographic groups in the intended user population that are missing from the dataset?

This phase of the model lifecycle focuses on transforming the input data into a cohesive, clean form that is ready for model training and prediction usage. The cleaning and transformation process can bake assumptions and biases from the model developers into the dataset. By making decisions around what kinds of default values are used to fill in missing data, or how to handle duplicate data, developers inject their own ideas of “good data” into the model inputs. These questions ask AI vendors to have data-cleaning protocols that try to anticipate sources of bias that may impact the output tool. By documenting their transformation logic and assumptions, auditors are able to flag potential sources of bias before it causes larger downstream problems.
Relevant Procurement Template Clauses
Provision 4 (Data Quality): “Bidding contractors should provide a report to [insert organization] on any bias mitigation techniques, ethics checklists, or product requirements documents that have been created during the development of the algorithmic system.
Bids that can demonstrate multi-disciplinary approaches and teams for algorithmic system design, using expertise from engineering, social sciences, and legal professions, will be given a higher rating.”
Helpful Tools
A summary report that includes a data quality assessment, a list of imputation techniques used to fill in missing data, and plain-text explanations regarding the design process.
AI vendors can provide a high-level report that details the scope of missing data and overall data quality in the input datasets. The list of techniques used for filling in missing data (both at a row-level and a column-level), along with assumptions used in setting default values for the missing data, can also be provided. This report should also include plain-text explanations of why the vendor elected to use these transformation and imputation techniques.
Extract, transform, and load (ETL) source code responsible for applying data transformations.
Obtaining access to the technical source code for the tool may not be possible due to intellectual property concerns, but having this level of transparency can enable fine-grained audits of any assumptions and biases embedded during the data transformation stage.

Additional Resources

Model Training

Summary
As the AI model is trained on the input datasets, it develops its own internal logical rules for building predictions. The model takes inputs and calculates the probabilities of the different potential classification options to arrive at the highest likelihood result. These internal rules can be examined for logical validity in the relevant problem areas (i.e., does a model predicting hypertension risk in patients learn clinically significant rules around blood pressure, height and weight, and age?).
This kind of inspection in models is known as explainable AI, which is a subset of algorithm design that seeks to explain the performance and decisions of algorithms to a human in accessible and easy-to-understand language. The ability to explain why an algorithm is making a recommendation, trace the data that inspired the recommendation, and emphasize the learned relationships between variables that informed a recommendation are all crucial parts of building trust and accountability with users.
Model explainability is important because it:
Allows models to be understood by users with less technical knowledge;
Provides accountability for regulatory or legal purposes;
Helps identify emerging bias or quality issues in the model; and
Improves trust in the model’s decisions.

Explainable AI is :
Local model explainability is the ability to isolate the factors that have the greatest influence on an individual model decision. This allows a model to expose the pieces of source data that influenced the model’s internal logic to arrive at a prediction when explaining an algorithmic decision. For example, local explainability would be used to explain to a customer why exactly their insurance appeal was rejected by an algorithm.
Cohort model explainability relates to subsets of data that can point to generalizability issues in the model. Using cohort model explainability can help with fairness metric testing, where testing on different cohort groups can verify a model’s ability to generalize across protected demographic attributes without impacting performance.
Global model explainability is the ability to isolate the factors that have the greatest influence across all of the model’s predictions. Global explainability is a powerful step in checking for innate bias in a model. For example, global explainability measures can help isolate the most important features in the model. This can be used to flag features that depend on demographic attributes like race, or proxy variables for demographic attributes, like zip code.

Certain types of algorithm models are more explainable than others because their internal logical rules are easier for humans to understand in plain language. An algorithm built using decision trees builds internal rules like “if the patient is older than 45 yrs and male, recommend a colonoscopy.” Other algorithms, such as neural networks used in deep learning, may have complex internal logical rules that are not easily interpretable by humans. AI developers can make different decisions on the types of models they use based on the explainability requirements needed.
Potential Questions for Procurement Officers to Ask
Does the vendor conduct vulnerability analysis for its underlying open source software packages?
Is the AI model using algorithms that allow for explainable predictions?

This phase of the model lifecycle examines how the selection of algorithm types can impact the transparency and security of the end model. Model training is often an automated process, where a wide variety of training configurations are explored for the most optimal performance. Managing and storing these training configurations safely is crucial for being able to inspect the internal logic of the models and ensure consistency of the configurations over time. The questions above aim to ensure that AI vendors are consider the transparency requirements that will be required to audit, trust, and utilize the model. They also ask vendors to ensure safe software practices by performing regular of used in building their tools, and having patching processes for safely upgrading to new versions for security and performance improvements.
Relevant Procurement Template Clauses
Provision 5 (Transparency): “[insert organization] should be able to provide a public-facing explanation of the algorithmic system's purpose and operations in plain language. The Contractor should enable the algorithmic system to be able to accommodate explainability of the entire system, as well as individual predictions.
This may include providing the highest-impact variables influencing a decision. The ability to view this explanation should be provided as an option along with the corresponding prediction.”
Helpful Tools for This Stage
A list of the most influential variables used in training the model (global explainability) can help verify that the model is making predictions using sensible logic. The ability to give an accounting of the most influential variables tied to a prediction (local explainability) will help users understand and trust the model recommendations.
A Software Bill of Materials report lists the open-source and commercial software components used by the AI tool. Identifying the underlying packages will provide procuring organizations with the ability to perform vulnerability analysis on their purchased tools and monitor for potential cybersecurity issues.
Source code for the model training pipeline.
Obtaining access to the technical source code for the tool may not be possible due to intellectual property concerns, but having this level of transparency can allow fine-grained audits of the tool’s functionality.

Additional Resources

Model Validation

Summary
A key component of algorithmic accountability is the question of how to measure the model’s predictive performance and fairness. How do you know when a model crosses the threshold between acceptable and unacceptable standards, when to discontinue its use?
Model validation is a crucial phase of development that serves as a quality assurance check on the performance of the model along different dimensions of accuracy and consistency. Accuracy metrics can be customized to fit the problem being solved, whether that leads toward prioritizing the ability to correctly predict positive cases, correctly predict negative cases, or achieving a balance between the two. A model developer can choose to make tradeoffs between different priorities in order to maximize the performance that best suits the problem. In the same way, consistency metrics focus on ensuring that predictions stay consistent over time across similar input data. Predictions should not vary wildly for cases that should produce the same (or similar) results.
Performance metrics for the model can also be extended into fairness and equity monitoring. The idea of prioritizing tradeoffs in performance according to user priorities also applies to fairness; a user might prefer slightly lower accuracy rates in the model if it ensures that the model has equal performance across legally protected attributes like race, gender, housing status, and religion. This is crucial for ensuring that the model does not violate anti-discrimination laws.
Potential Questions for Procurement Officers to Ask
What fairness metrics monitoring does your product support?
What are the emergency stop conditions for retiring a model that is failing monitoring and validation standards?

This phase of the model lifecycle is meant to be a quality assurance check that ensures the performance of the model is robust and equitable. This is conducted through defining the right validation metrics, as well as the metric thresholds for determining when a model is in a usable or unusable state. Being able to measure the right results has to be followed by an action plan for resolving poor performance in a tool, and there should be a defined threshold for triggering a stop clause for the vendor model.
Relevant Procurement Template Clauses
Provision 2 (Regulatory Compliance): “The algorithmic system should be built to comply with U.S. anti-discrimination laws governing protected attributes like race, gender, religion, and sexuality. The selected Contractor should be able to provide evidence of satisfactory compliance with anti-discrimination standards before system deployment and upon request by [organization].”
Provision 6 (Monitoring & Evaluation): “Bids that have technical architecture proposals incorporating fairness metrics monitoring and algorithmic bias prevention measures will be favored.”
Helpful Tools for This Stage
AI vendors may have their own internal methodologies for testing the resiliency, security, and fairness of their products. Procurement officers can ask vendors to share “red team” reports that describe their own attempts to hack the software or manipulate the predictions to unfairly favor certain groups. This can consist of a high-level report that describes the “red team” exercise methodology, as well as the successes, failures, and recommendations from the exercise results. Recurring “red team” exercises may also be a stipulation that procuring officers request of AI vendors for due diligence purposes.
Procurement officers can ask AI vendors to share the validation results performed internally for product quality assurance. This can consist of a confusion matrix that measures performance across protected demographic attributes (e.g., race, gender, sexuality, or age), or some combination of F-scores, accuracy, or precision and recall metrics.
for making test predictions against sensitive demographic attributes.
AI vendors can expose an API to their tool that allows the procuring organization to test their own sample data and have the ability to independently verify the performance metrics of the tool, rather than having to rely on metrics supplied by the vendor.

Additional Resources

Model Deployment

Summary
Once a model has been validated and approved for deployment, there should be a plan for ensuring it stays in compliance with contract standards for performance and fairness.
A monitoring system for tracking the data, model predictions, and real-world user impact should be in place for both the procuring organization and vendor to evaluate the system in real-time. The moment a model is deployed to the real world, its performance begins to slowly degrade, similar to how a car’s value depreciates as it is used over time. This degradation is known as “,” and it occurs because the data initially used to train the model gets stale and increasingly diverges from new data being used for predictions. Changes in patient behavior habits, population rates, and other social trends can all cause present-day data to slowly drift away from training data that was collected in the past.
The process of defining monitoring metrics is also a critical component of tracking the signals that can flag potential issues in a model. The best metric is one that directly monitors the desired result from implementing an AI recommendation in real time, rather than a proxy metric that approximates the result a model wants to measure.
As a real-world example that had deep patient care impacts, a widely used healthcare algorithm that predicted a patient’s healthcare needs had trained its model on a faulty proxy metric. to not receive the same level of care as white patients with the same disease burdens. This occurred because the algorithm used the patient cost utilization as a proxy metric for healthcare needs, rather than a metric directly measuring the severity and complexity of a patient’s illness. The algorithm ended up ignoring the obstacles that prevent patients from accessing or purchasing healthcare treatment for their needs, skewing its predictions against patients who were already struggling. A proxy metric introduces new assumptions into the evaluation of a model’s performance and can be a critical vulnerability in building trust and transparency for a deployed AI tool.
Potential Questions for Procurement Officers to Ask
Would the vendor be willing to undergo a regular third-party audit conducted under a NDA?
Is there a monitoring plan for measuring changes in the input data and model performance metrics?
What kinds of user training will be supplied by the vendor?
How will changes in the software be communicated to users by the vendor, and how much advance notice will be given for that disclosure?

This phase prioritizes safe and continuous monitoring of the tool while it is being used. The goal is to ensure that the model stays in compliance with the user’s needs and standards over time, beyond the initial quality assurance validation. The questions above allow a procurement officer to check for a deployment plan that incorporates regular audits and metrics monitoring for identifying worsening performance issues; user disclosure plans, for the safe application of patches and upgrades to the tool; and training workshops that ensure users are working with the tool correctly.
Relevant Procurement Template Clauses
Provision 5 (Transparency): “The selected Contractor will notify [insert organization] 30 days in advance of new changes being made to a deployed algorithmic system. This notification should include a change log of new, deprecated, and updated functionalities. For major changes, a report on the testing process for assuring business continuity of the algorithmic system should be provided by the bidding Contractors.”
Bids that can incorporate an algorithm change protocol (ACP) document (as recommended by the Food & Drug Administration for software as medical devices) will be given preference. These ACPs should detail the anticipated changes that will be made to the algorithmic system during the contract period.”
Provision 6 (Monitoring & Evaluation): “The algorithmic system should involve human-in-the-loop decision-making based on the risk to patients in the contract use case. If the system is involved in medical decision-making on behalf of a patient, a healthcare professional should be involved in evaluating whether individual recommendations made by the system are used or not.
A third-party auditing organization should conduct regularly scheduled assessments on the provisions listed in this document in order to ensure compliance. These audits will be paid for by [insert organization] in order to minimize conflicts of interest.”
Helpful Tools for This Stage
for key metrics, and a rollback plan in case model performance degrades to unacceptable levels.
AI vendors can report the average degradation rate in the model's accuracy performance. This allows the procuring organization to make estimates about the anticipated timeline for re-training AI models and model deployments.
AI vendors can share a list of metrics used in their monitoring systems and give procuring organizations direct visibility into the health of their tool to monitor performance levels. A procuring organization can design a dashboard that organizes these metrics and sets alerts based on the agreed-upon performance metric thresholds.
AI vendors can share their designated rollback plan in case of tool failures or security risks and provide a designated technical contact for support.
for verifying continued compliance of the model with initial contract conditions.
Including contractual terms that allow a third-party auditor to regularly assess an AI vendor’s tool for performance quality, underlying system stability, equity and bias flags, and regulatory compliance can help assess whether a vendor is meeting their contracted performance standards.
Design history files (DHFs) were originally used by the FDA to track changes made to a device across the entire design history. This can be a file maintained by the AI model designer by which every change made to a device is appended onto the history of the file.
The algorithm change protocol (ACP) is the most recent extension of this concept, as it asks AI model designers to also include anticipated future changes that will be made to a device. The ACP allows for a device to evolve over time and was adapted to govern algorithms as medical devices more effectively. Procuring organizations can ask AI vendors to submit DHFs and ACPs in order for users to be able to track and anticipate changes to the AI tool.

Additional Resources

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.