section of this toolkit. After you have investigated and mitigated or disclosed any potential areas of bias in your dataset, you’re ready to train your algorithm! This section also references
a technical module
that demonstrates the various statistical fairness metrics mentioned below.
Remember that with supervised machine learning, you give your algorithm both a dataset (which you cleaned in our previous section) and a task. Separate from biases in the dataset itself, the way you teach your algorithm to get better at this task can introduce various types of biasーboth from the existing education context and from your team’s own biases or blind spots. This is why the blind spot activities during
, on how to remove biased features and features correlated with them. While training your model, you may find that some features are weighted heavily by the model in undesirable ways. Consider using a technique called
to see which features your ML model considers most important in making decisions, and see if these line up with your values. We encourage you to disclose to schools the list of considered features and their importance in the model for full transparency. If schools have concerns about the features you used, consider the unintended consequences your product might create. This might feel frustrating, but it’s imperative that your users have an active voice in the logic behind your algorithm.
Investigate Your Features
Consider the ways in which your top features may be subject to bias. How was the data behind this feature recorded? What structural or systemic bias influences it? Very few data points are actually objective, especially in education data. For example, the way questions are worded or familiarity with technology can often impact
, reflected in grades data, commonly thought to be objective. Even more subjective, discipline records reflect the reality that Black students are suspended at significantly higher rates for the same behavior as compared to white students. Rather than thinking of data as “subjective” or “objective”, think of it as more or less susceptible to bias. If any of your features are derived from human interaction within your product (ex: the way a teacher determines mastery levels for a student after completing a module), these are easy places for bias to enter your algorithm. Overtime, this can become very hard to decipher from the “logic” your algorithm determines.
Talk with schools, students, and families to get a complete understanding of what factors may influence the data points that play an influential role in your algorithms. While you may not be able to address the underlying bias in your algorithm, this understanding should influence the way you design your product (
) , including the way educators are trained to use your product.
Activity 2: Minimize Potential Bias
You need a model that not only optimizes for your outcome but also minimizes what could go wrong. This evaluation will be different for every model in every scenario. As with all machine learning, evaluate your
Precision effectively measures the frequency of false positives, or number of people flagged who do not have a certain characteristic. For example, how often do you flag a student as a dropout risk who isn’t actually a dropout risk? Or how often do you push a student up to the next reading level when they aren’t actually ready?
Recall effectively measures the frequency of false negatives, or number of people with a certain characteristic who are not flagged. For example, how often do you overlook a student who is a dropout risk? Or how often do you keep a student at a current reading level who is ready for a more advanced book? Depending on your scenario, you may find recall more important than precision or vice versa.
Low precision means your algorithm is more likely to have a high rate of false positives (for example, incorrectly identifying a student as at-risk of bringing a gun to school), whereas low recall would lead to a high rate of false negatives (for example, incorrectly labeling a student as NOT at-risk of bringing a gun to school, when they are actually at high risk). The scenario behind your algorithm influences whether false negatives or false positives are more dangerous, and how to trade-off between precision and recall when optimizing your algorithm. In this example, you might say false negatives (not identifying a student as at-risk when they are) is significantly more dangerous than accidentally flagging a student for investigation who is not actually at-risk. Better safe than sorry, right? This would lead you to maximize recall, and be willing to allow for some precision.
However, it's not that simple. Precision and recall can be different for specific subsets of your population. For example, if precision is actually lower for Black students than for others due to numerous blind spots and data impacted by racist discipline policies, Black students would be identified as at-risk more often than white students, even though they are not more at-risk of bringing a gun to school. This can reinforce the racial bias that contributes to
on Black children. You should investigate fairness metrics as a whole and for subsets of populations based on sensitive data like race and income.
Even if you maximize both precision and recall, the distribution of errors still matters. That is, even if the percentage of errors in your model is low, the errors that do exist may disproportionately impact a specific group of students. For example, if you only misidentified 10 students as at risk out of 10,000 flagged students, this is a low error rate. But if all 10 students were Black (and the total population of students is not all Black), your model is still problematic.
is often contested in education spaces, and it is important that you honor the perspectives of the schools and families you work with to guide your team’s definition of fairness. There are many fairness philosophies that could inform the approach your product takes. Personalized learning, at its core, is based on the idea that different students need different learning experiences to reach the same educational outcome. Most schools and families would agree that students who have fallen behind should receive additional and extra support to catch up. This is equity, not equality. Your algorithm should account for these types of scenarios in alignment with the definition of equity that your schools and their communities adopt. These conversations may lead to difficult conversations within your team and with your users, but this is critical work. The edtech industry needs leaders who consider schools and families as partners in development and will be transparent about the tough questions that educators and now we, as technologists, must confront.
aim to quantify different kinds of bias in ML models. This is a very active field with which your team should become familiar. Various definitions of statistical fairness subscribe to different ethical and philosophical definitions about what it means to be “fair”--a complicated socio-ethical question that is exacerbated by algorithm-based decision making. For example, statistical fairness methods may or may not minimize disparate treatment versus disparate impact. Some people may feel a product is equitable so long as there isn’t “disparate treatment”, meaning that a group of white students and a group of Black students is treated the same by the product. However other definitions of fairness such as individual fairness (fairness through awareness) would also evaluate “disparate impact”, meaning that the product should have the same impact on (or outcomes for) white students and Black students. You can use the results from the statistical fairness calculators on our
explains that the right equality metric depends on the scenario at hand and the relevant action that will result. If an algorithm is being used simply to communicate a statistical likelihood of something, that can be taken at face value. But if the algorithm is used to determine the intervention or support a student receives, the schools’ values should be considered. Your team should all be on the same page, along with the schools you serve, about the statistical fairness metrics to which you subscribe. For example, although you should ideally score well across all the types of statistical fairness in the
Activity 3: Determine the Size of Your Model: One Global Model vs. Many Small Models per School, District, or State
It depends! Training one global model requires access to a comprehensive and representative dataset, which can be very difficult in education and has additional costs associated with it. However, when you train multiple smaller models, you still need to assess for differences in outcomes across different models, as
can lead to unintended discrimination where students are treated differently at different schools or population types. That being said, if you end up with significantly different models across schools or districts, having more than one model will ensure you don’t
and lose predictive power for the nuanced student needs of each population. Research shows that education models developed on certain student demographics do not apply well to a different demographic population –
. We recommend trying both to see how they perform. Discuss concerns with your team and ideally also with researchers, and make sure the perspectives of all team members and experts are not only heard but also addressed.
Use the table below for this activity.
Model Approach: Pros and Cons
One Global Model
Less likely to over-fit (must optimize for many different types of students, schools, etc.)
More likely to under-fit (must optimize for many different types of students, schools, etc.)
More easily managed by a single team
Requires a large, representative dataset
Requires more work to account for biases and assumptions in data
Mode difficult to identify when model updates are required and determine whether they are justified across a broad, diverse population
Multiple Small Models
Less likely to under-fit (model must optimize for individual, unique subsets of the data)
More likely to over-fit to specific populations (due to its focus on smaller, specific populations)
Gives team the opportunity to better customize model to individual student and school needs
Requires additional engineering resources to maintain and monitor models for each school, during initial development as well as ongoing support
Allows schools without ample historical data to still benefit from your product
Can update individual models without worrying about impact to other schools using the same product
Limits reach only to schools or populations with ample data to train a relevant model
There are no rows in this table
Questions to Ask Your Team
How many engineers will we dedicate to monitor and maintain our model(s)? Only commit to as many models as you can adequately maintain for the next two school years with known engineering resources. If you aren’t sure, be conservative.
To how many schools are we planning to roll out these model(s)? Consider your expectations for the next two school years. If you plan to greatly expand the number of schools you work with, you will need to revisit not only your models but also these decisions as you grow.
How confident are we that ML is the right tool for our problem? If uncertain, you might want to try one model, assess its validity and value with a small school population before trying to customize models across many schools.
✅ Activities for Improving Algorithms
Even if you’ve designed the perfect model, you will always need to continuously update and retrain your model. You should update your models whenever you receive new data through an automated process, if possible. You should also track your models’ performance to ensure that there are no performance regressions. It is up to you to define what your fairness metrics look like; however, at a basic level, precision and recall across protected classes like race and income should not degrade or deteriorate over time. Alerts (discussed below) will also indicate a need to change your model before they negatively impact a student. Refer back to the Activities for Training Algorithms above for debugging tips and ways to identify and address bias in your model.
Activity 1: Check for Bias in the Way Your Improve Your Algorithm Over Time
You will need to constantly update your algorithm over time. The data points you use to correct or validate your algorithm can be opportunities to eliminate and introduce bias at the same time. Allowing teachers or students to disagree with your algorithm can help you catch problems in your model (See
), but incorporating user data at scale can also amplify bias. For example, if a teacher typically assigns content on basketball to her Black students (for legitimate or biased reasons unknown to your product), you might learn overtime that Black students (or students who fit a certain profile) should be assigned content at lower reading levels, because the basketball content happens to be at a lower reading level. In a similar situation, students of a lower socioeconomic class may not comprehend math problems about yacht speeds which might teach your algorithm that these students have lower math proficiency. Make sure you are equally as rigorous with feedback data as you were with your training dataset at the start of development. It is important that you investigate, in human language, the qualitative and quantitative data points you use to retrain or modify your algorithm. Do this with someone on your team or, ideally, with users who understand the greater educational context.
You can use the below chart to map out possible problems with the data points you plan to use to improve your algorithms.
What does this data tell us?
How should it influence our algorithm?
Would others agree?
Teacher assigns a student content at lower grade level than the student has been assigned in your app. You might use this information to improve your algorithm over time.
We interpret this to mean that the teacher believes the student did not understand the prior material.
Our algorithm should incorporate this data to learn over time that students with this history and profile progress slower than we previously expected.
Note that algorithms could learn this for a specific student population for example for all Black students which will eventually amplifies biases that already exist in the classroom.
In some cases, this might be helpful and truly improve the algorithm. However, when discussing with teachers, you might learn that there were other explanations for why the teacher assigned a lower level of content. For example, the teacher might have thought the newly assigned content was simply more engaging or relevant for a given student, or the teacher might not have known that the new content was of a lower reading level at all.
There are no rows in this table
Activity 2: Update Your Models Equitably
If you choose to use machine learning in your product, you’ll need to update models when things go wrong. Ensuring that you can update your model at least once a week is critical, but students need WiFi at this same frequency in order for patch delivery systems to work. If you’ll need to continuously update a model in your app, make sure schools are aware of the need for reliable WiFi and the impact to the experience when students are not regularly connected to the internet.
School administrators don’t like to update apps during the school year to minimize disruptions to the learning experience. We recommend you cache the relevant machine learning models on the device so that it can update them whenever it connects to the internet without requiring a full app update.
Track which versions your users are on. If more than 10% of your users are regularly on older versions, a large chunk of students are using less optimal versions of your product. It is likely that the students not using the newest versions are those with less resources. Warn users within the app if it has not connected to the internet after a set number of days.
Depending on the algorithm you’re working with, the timing of updates can impact student outcomes. You should always evaluate student performance on the same activity within the same ML model. For example, if you update a model before some students have completed their work, they may get penalized by a harsher model or evaluated by a less stringent one. When this happens, it becomes difficult to compare between students. This gets complicated when students progress at different rates, so make sure your system is capable of delivering updates in a way that evaluates all students by the same model for a given evaluation.
Activity 3: Monitor Continuously for Inequities
You can use internal metrics to track whether bias is entering your product’s automated decision processes. Set up metric thresholds that would indicate something went wrong. For example, you could set a predefined threshold alert if the % of Black or Brown students recorded as having behavioral problems is a standard deviation away from the % of Black or Brown students in the total population. When these metrics fall out of predefined bounds, alert your team to fix the problem immediately. This should be a continuous process to ensure your product remains equitable over time. If you don’t have access to racial data, use proxies. This is an important part of knowing what impact your product actually has.
Talk to your users and ask them what concerns they have, and monitor for these. What biases do they see in their existing systems? Set and monitor thresholds accordingly. When implemented in partnership with students and teachers, this can be a way for technology to help schools correct existing racial bias.
🎯Conclusion: Teach Your Algorithms Well
As you know, algorithms are not unbiased! In addition to the data you use to train your algorithm, the metrics you choose to optimize reflect the set of values to which your team subscribes. Make sure the schools and families you work with would subscribe to these same values, and put checks in place to test for unintended racial or socioeconomic bias in your models and their outputs.
Next, let's explore the way your algorithms are employed and how your product is