# Training Algorithms

This section explains how to identify areas of racial bias in the way you train your algorithm. Before you proceed, make sure you review the section of this toolkit. After you have investigated and mitigated or disclosed any potential areas of bias in your dataset, you’re ready to train your algorithm! This section also references
a technical module
that demonstrates the various statistical fairness metrics mentioned below.
Remember that with supervised machine learning, you give your algorithm both a dataset (which you cleaned in our previous section) and a task. Separate from biases in the dataset itself, the way you teach your algorithm to get better at this task can introduce various types of biasーboth from the existing education context and from your team’s own biases or blind spots. This is why the blind spot activities during are so important. In this section, we’ll highlight items to consider as you select key aspects of your machine learning algorithms: your , , , and for use in your education products.
After this section, you'll take a look at your .

# 🎯 Goals

Consider the values of your schools customers, Black and Brown students, and their families
Train your algorithms with an to equity in mind
Design thresholds to monitor for racial inequity
Make a plan for how and how often you’ll update your models

# 🚧 Caution!

Make sure your optimization functions for your algorithms align with the values of students and teachers who will use your products
Make sure schools and families are comfortable with the features most important to your algorithm, as well as your approach to statistical "fairness"
Constantly revisit these practices as your algorithm "improves" with new data over time
Follow the industry best practice of allocating 25% of engineering time to maintenance, and explicitly prioritize equity during maintenance and evaluation

# ✅ Activities for Training Algorithms

Activity 1: Further Refine Your Selected Features
See the previous section, , on how to remove biased features and features correlated with them. While training your model, you may find that some features are weighted heavily by the model in undesirable ways. Consider using a technique called to see which features your ML model considers most important in making decisions, and see if these line up with your values. We encourage you to disclose to schools the list of considered features and their importance in the model for full transparency. If schools have concerns about the features you used, consider the unintended consequences your product might create. This might feel frustrating, but it’s imperative that your users have an active voice in the logic behind your algorithm.

Consider the ways in which your top features may be subject to bias. How was the data behind this feature recorded? What structural or systemic bias influences it? Very few data points are actually objective, especially in education data. For example, the way questions are worded or familiarity with technology can often impact . Studies show that white teachers (who make up 80% of the teaching force) often hold of Black students which leads to , reflected in grades data, commonly thought to be objective. Even more subjective, discipline records reflect the reality that Black students are suspended at significantly higher rates for the same behavior as compared to white students. Rather than thinking of data as “subjective” or “objective”, think of it as more or less susceptible to bias. If any of your features are derived from human interaction within your product (ex: the way a teacher determines mastery levels for a student after completing a module), these are easy places for bias to enter your algorithm. Overtime, this can become very hard to decipher from the “logic” your algorithm determines.
Talk with schools, students, and families to get a complete understanding of what factors may influence the data points that play an influential role in your algorithms. While you may not be able to address the underlying bias in your algorithm, this understanding should influence the way you design your product () and the way your product is implemented in schools () , including the way educators are trained to use your product.
Activity 2: Minimize Potential Bias
You need a model that not only optimizes for your outcome but also minimizes what could go wrong. This evaluation will be different for every model in every scenario. As with all machine learning, evaluate your scores, but with racial bias in mind.
Precision effectively measures the frequency of false positives, or number of people flagged who do not have a certain characteristic. For example, how often do you flag a student as a dropout risk who isn’t actually a dropout risk? Or how often do you push a student up to the next reading level when they aren’t actually ready?
Recall effectively measures the frequency of false negatives, or number of people with a certain characteristic who are not flagged. For example, how often do you overlook a student who is a dropout risk? Or how often do you keep a student at a current reading level who is ready for a more advanced book? Depending on your scenario, you may find recall more important than precision or vice versa.
Low precision means your algorithm is more likely to have a high rate of false positives (for example, incorrectly identifying a student as at-risk of bringing a gun to school), whereas low recall would lead to a high rate of false negatives (for example, incorrectly labeling a student as NOT at-risk of bringing a gun to school, when they are actually at high risk). The scenario behind your algorithm influences whether false negatives or false positives are more dangerous, and how to trade-off between precision and recall when optimizing your algorithm. In this example, you might say false negatives (not identifying a student as at-risk when they are) is significantly more dangerous than accidentally flagging a student for investigation who is not actually at-risk. Better safe than sorry, right? This would lead you to maximize recall, and be willing to allow for some precision.
However, it's not that simple. Precision and recall can be different for specific subsets of your population. For example, if precision is actually lower for Black students than for others due to numerous blind spots and data impacted by racist discipline policies, Black students would be identified as at-risk more often than white students, even though they are not more at-risk of bringing a gun to school. This can reinforce the racial bias that contributes to and the long-term impacts of on Black children. You should investigate fairness metrics as a whole and for subsets of populations based on sensitive data like race and income.
Even if you maximize both precision and recall, the distribution of errors still matters. That is, even if the percentage of errors in your model is low, the errors that do exist may disproportionately impact a specific group of students. For example, if you only misidentified 10 students as at risk out of 10,000 flagged students, this is a low error rate. But if all 10 students were Black (and the total population of students is not all Black), your model is still problematic.
This can help measure these characteristics of your model.

Activity 2: Use Statistical Fairness Metrics to Evaluate Equity
aim to quantify different kinds of bias in ML models. This is a very active field with which your team should become familiar. Various definitions of statistical fairness subscribe to different ethical and philosophical definitions about what it means to be “fair”--a complicated socio-ethical question that is exacerbated by algorithm-based decision making. For example, statistical fairness methods may or may not minimize disparate treatment versus disparate impact. Some people may feel a product is equitable so long as there isn’t “disparate treatment”, meaning that a group of white students and a group of Black students is treated the same by the product. However other definitions of fairness such as individual fairness (fairness through awareness) would also evaluate “disparate impact”, meaning that the product should have the same impact on (or outcomes for) white students and Black students. You can use the results from the statistical fairness calculators on our when evaluating your model or to guide your modifications. It’s important to note that many approaches to statistical fairness require collection of sensitive data in order to on different demographics.
explains that the right equality metric depends on the scenario at hand and the relevant action that will result. If an algorithm is being used simply to communicate a statistical likelihood of something, that can be taken at face value. But if the algorithm is used to determine the intervention or support a student receives, the schools’ values should be considered. Your team should all be on the same page, along with the schools you serve, about the statistical fairness metrics to which you subscribe. For example, although you should ideally score well across all the types of statistical fairness in the , each “kind” of fairness will have different implications for your specific educational context. You can also use the and this blog post for strategies.
Check out this infographic:

Activity 3: Determine the Size of Your Model: One Global Model vs. Many Small Models per School, District, or State
It depends! Training one global model requires access to a comprehensive and representative dataset, which can be very difficult in education and has additional costs associated with it. However, when you train multiple smaller models, you still need to assess for differences in outcomes across different models, as can lead to unintended discrimination where students are treated differently at different schools or population types. That being said, if you end up with significantly different models across schools or districts, having more than one model will ensure you don’t and lose predictive power for the nuanced student needs of each population. Research shows that education models developed on certain student demographics do not apply well to a different demographic population – . We recommend trying both to see how they perform. Discuss concerns with your team and ideally also with researchers, and make sure the perspectives of all team members and experts are not only heard but also addressed.
Use the table below for this activity.
Model Approach: Pros and Cons
Model Approach
Scope
Management
1
One Global Model
Less likely to over-fit (must optimize for many different types of students, schools, etc.)
More likely to under-fit (must optimize for many different types of students, schools, etc.)
More easily managed by a single team
Requires a large, representative dataset
Requires more work to account for biases and assumptions in data
Mode difficult to identify when model updates are required and determine whether they are justified across a broad, diverse population
2
Multiple Small Models
Less likely to under-fit (model must optimize for individual, unique subsets of the data)
More likely to over-fit to specific populations (due to its focus on smaller, specific populations)
Gives team the opportunity to better customize model to individual student and school needs
Requires additional engineering resources to maintain and monitor models for each school, during initial development as well as ongoing support
3
Other Considerations
Allows schools without ample historical data to still benefit from your product
Can update individual models without worrying about impact to other schools using the same product
Limits reach only to schools or populations with ample data to train a relevant model
There are no rows in this table
3
Count
How many engineers will we dedicate to monitor and maintain our model(s)? Only commit to as many models as you can adequately maintain for the next two school years with known engineering resources. If you aren’t sure, be conservative.
To how many schools are we planning to roll out these model(s)? Consider your expectations for the next two school years. If you plan to greatly expand the number of schools you work with, you will need to revisit not only your models but also these decisions as you grow.
How confident are we that ML is the right tool for our problem? If uncertain, you might want to try one model, assess its validity and value with a small school population before trying to customize models across many schools.

# ✅ Activities for Improving Algorithms

Even if you’ve designed the perfect model, you will always need to continuously update and retrain your model. You should update your models whenever you receive new data through an automated process, if possible. You should also track your models’ performance to ensure that there are no performance regressions. It is up to you to define what your fairness metrics look like; however, at a basic level, precision and recall across protected classes like race and income should not degrade or deteriorate over time. Alerts (discussed below) will also indicate a need to change your model before they negatively impact a student. Refer back to the Activities for Training Algorithms above for debugging tips and ways to identify and address bias in your model.
Activity 1: Check for Bias in the Way Your Improve Your Algorithm Over Time

You can use the below chart to map out possible problems with the data points you plan to use to improve your algorithms.
Feedback
Data Point
What does this data tell us?
How should it influence our algorithm?
Would others agree?
1
Teacher assigns a student content at lower grade level than the student has been assigned in your app. You might use this information to improve your algorithm over time.
We interpret this to mean that the teacher believes the student did not understand the prior material.
Our algorithm should incorporate this data to learn over time that students with this history and profile progress slower than we previously expected. Note that algorithms could learn this for a specific student population for example for all Black students which will eventually amplifies biases that already exist in the classroom.
In some cases, this might be helpful and truly improve the algorithm. However, when discussing with teachers, you might learn that there were other explanations for why the teacher assigned a lower level of content. For example, the teacher might have thought the newly assigned content was simply more engaging or relevant for a given student, or the teacher might not have known that the new content was of a lower reading level at all.
There are no rows in this table
1
Count

Activity 2: Update Your Models Equitably
If you choose to use machine learning in your product, you’ll need to update models when things go wrong. Ensuring that you can update your model at least once a week is critical, but students need WiFi at this same frequency in order for patch delivery systems to work. If you’ll need to continuously update a model in your app, make sure schools are aware of the need for reliable WiFi and the impact to the experience when students are not regularly connected to the internet.
School administrators don’t like to update apps during the school year to minimize disruptions to the learning experience. We recommend you cache the relevant machine learning models on the device so that it can update them whenever it connects to the internet without requiring a full app update.
Track which versions your users are on. If more than 10% of your users are regularly on older versions, a large chunk of students are using less optimal versions of your product. It is likely that the students not using the newest versions are those with less resources. Warn users within the app if it has not connected to the internet after a set number of days.
Depending on the algorithm you’re working with, the timing of updates can impact student outcomes. You should always evaluate student performance on the same activity within the same ML model. For example, if you update a model before some students have completed their work, they may get penalized by a harsher model or evaluated by a less stringent one. When this happens, it becomes difficult to compare between students. This gets complicated when students progress at different rates, so make sure your system is capable of delivering updates in a way that evaluates all students by the same model for a given evaluation.

Activity 3: Monitor Continuously for Inequities
You can use internal metrics to track whether bias is entering your product’s automated decision processes. Set up metric thresholds that would indicate something went wrong. For example, you could set a predefined threshold alert if the % of Black or Brown students recorded as having behavioral problems is a standard deviation away from the % of Black or Brown students in the total population. When these metrics fall out of predefined bounds, alert your team to fix the problem immediately. This should be a continuous process to ensure your product remains equitable over time. If you don’t have access to racial data, use proxies. This is an important part of knowing what impact your product actually has.
Talk to your users and ask them what concerns they have, and monitor for these. What biases do they see in their existing systems? Set and monitor thresholds accordingly. When implemented in partnership with students and teachers, this can be a way for technology to help schools correct existing racial bias.

## 🎯Conclusion: Teach Your Algorithms Well

As you know, algorithms are not unbiased! In addition to the data you use to train your algorithm, the metrics you choose to optimize reflect the set of values to which your team subscribes. Make sure the schools and families you work with would subscribe to these same values, and put checks in place to test for unintended racial or socioeconomic bias in your models and their outputs.