This section explains how to explore the racial bias that may currently or historically contribute to the data you've collected. All algorithms, however basic or complex, require a dataset. Although data is often considered “objective,” it is far from it. Data reflects the dynamics, including bias, of a given context. Racial bias exists in many forms in today’s school systems as a result of decades of racial and socioeconomic segregation, as well as new and old systemic structures that disadvantage Black, Brown, and low-income students. This section also includes a
with some basic code to demonstrate how you might find blindspots in your dataset. As technologists in education, we must think critically about the datasets we use and the biases they perpetuateーeven when they are “accurate.” This section covers common types of biases that can exist in your dataset and ways to address these issues.
After this section, you'll investigate the ways you use this data for
Identify missing or biased data and modify your dataset
Identify forms of bias that must be addressed outside of your algorithm
Future-proof your dataset with an update plan
Make sure you understand the context of your dataset. Take responsibility for understanding the social contexts in which the data was collected, and the history behind this data. Even if you get your data from a school, you should be able to thoroughly explain the context behind each of the data points yourself. If you’ve collected the data yourself from within your product, make sure the whole team understands what this data represents and the social context in which students and teachers use your product. Ask yourself questions such as:
What data do we need? Where and how did/will we get the data?
What is correlated with race? What racial bias might exists in the data?
Are the impacted populations comfortable with what the data suggests?
Are these aligned to what outcomes they want?
Does the data represent our expected user population? Is there enough data in all quadrants?
✅ Activities for Datasets
Section 1: Finding Problems
Activity 1: Who Should Be Present in Your Dataset
This activity emphasizes basic good data practice, but it's still important to call out before you check for blind spots. Identify which students make up your current and target populations. How does their profile differ from that of an “average” hypothetical student, and what cultural challenges and perspectives are important to represent? If you hope to serve mostly large urban school districts across America, your dataset should represent a more diverse population than a subset of smaller, suburban districts. It is fine to begin with a small dataset while partnering with a few customers, but before you set your sights on a larger customer base, make sure you can access a larger, representative dataset that will match the population of schools you hope will deploy your product. Research shows that you need to validate that your model works well for students of different races, of English Language Learners (ELL) vs. non-ELL status, and of
Exercise: Identify your target population and research the demographics of students and teachers at these schools. Compare this to the demographic breakdown of your dataset, and modify where needed. Contrary to what you might think, you should make sure race is included! Even if you can’t collect race for individual student data points, you can use the overall demographic breakdown of schools whose data you use.
Activity 2: Check for (Data) Blind Spots
Now that you know who should be represented in your dataset, identify whatscenarios should be present in your dataset too. Machine learning algorithms can only learn based on the data you provide. This places enormous responsibility in the hands of the provider of data (you!). If there are blind spots in your data, your algorithm will struggle to handle students who fall in those blind spots effectively. For example, even if you have enough non-white students represented in your data, if your dataset contains few examples of non-white students who “succeed,” your algorithm will struggle to advance non-white students.
to demonstrate ways to find blind spots in your datasets.
This is an often-skipped yet critical step. You should understand not only your target population of students but also the type of students that should be represented in your dataset. You will need to address these blind spots by improving your dataset. For example, if you find that your data has only a few examples of Black students who receive all A’s throughout high school, you’ll need to modify to widen your sample. During development, we encourage you to share your blind spots with the schools you work with, and especially with entities that provided data to you. They might recognize certain blind spots as problems that you may not recognize on your own. Some of these gaps you may not be able to fix, but you will need to consider them in the design and implementation of your product.
“But race is a sensitive data point, so we don’t collect it”, said every company ever. Fairness through unawareness
, and treating all students “the same” can lead to gaps in awareness about how technologies and algorithms disadvantage particular populations. Tracking race and other sensitive data points will help you identify implicit bias, representation or sample bias, and less obvious blindspots in your data. Even if you aren’t able to collect individual students’ race, you can always use schools’ demographic breakdowns which are public information.
Activity 3: Check for Proxies
"We don't collect race data, so we're good, right?" Race shouldn't be an input feature for a machine learning algorithm in education (and this is illegal in many cases). However, your data might implicitly encode race in other features. Many features, or combinations of features, in your dataset, might correlate strongly with race, such as free and reduced lunch (FRL) status, discipline records, and zip code. You may also find that combinations of features correlate with race that you didn’t expect. These findings require you to think deeply about why this might be the case and to share them with schools you work with. For example, certain demographics may be
, reflected in the number of hints requested. Talk with schools to learn what this might mean and how your model could perpetuate or account for this behavior. Furthermore, even if you don’t collect race or any proxies for it, you are still responsible for testing whether your algorithms and products
to categorize data in your dataset (e.g. you might assign labels in the form of test scores, academic grades, dropout events, etc. to each student record). It's easy to think of labels as the "truth" the same way we often think of school test results or true/false events to be “objective” measures of learning or success, however, in reality, many of these labels are frequently influenced by
If you don't explicitly investigate for potential bias, your ML model will encode this bias into your product. This is especially true if you chose to use unsupervised ML which will use algorithms to predict labels or cluster data according to patterns the algorithm finds, rather than labels your team provided. Algorithms can introduce even more bias when creating their own categories than a thoughtful human might. Furthermore, many of these metrics, like grades or dropout events, suffer from a lack of context and confirmation: Were grades assigned fairly? Would other teachers agree with the grade determination? Was the content culturally accessible, or did it mention places most of the school’s Black students have never visited? Did zero-tolerance suspension policies change after the data was recorded? What other pieces of data about a student, like the projects they created or the way they communicate their ideas and work with other students, are missing? Can we accurately categorize a dropout as “negative”? These are already complex questions in the education space, and they are further complicated by the use of technology and algorithms. It is even more important that you and the schools you work with openly discuss the ways bias is reflected in school data and how your product incorporates this data, so that humans at the school can interpret this information correctly.
Proxy Labels and Label Prediction: Sometimes, exact labels as you want to use them won’t exist, so you'll use a different feature (or metric) as a proxy for the label you want to use. For example, you may choose to use time spent on an assignment as a proxy for student engagement, but these two features are not exactly the same. It is important to ensure you investigate the relationship between your ideal label and the feature you use as its proxy to assess for ways this could go wrong. In this example, what contributes to time spent that may not apply to student engagement? When might time spent and student engagement not be correlated? Perhaps students who are confused or bored will take longer to move through the same modules. Make sure you account for these in the conclusions you take away from the data or, eventually, the way you interpret the output from your algorithms.
To Do: What if my dataset doesn’t have labels or a proxy? Read
With machine learning, the labels you choose will teach your algorithm what is “right” and “wrong”. Your machine learning algorithm will eventually learn to output its own values based on what it has learned. For example, imagine that an algorithm learns from real discipline data in which Black boys are suspended at a
compared to other students. The algorithm will "correctly" learn to categorize Black boys' behavioral records as severe and reinforce the existing bias, based on what it learned from real data. Rather than automating the bias in existing systems, what you determine to be “right” or “wrong” should be informed by the values of the schools and communities you work with. You’ll need to label your training data to teach your algorithm that the suspension of a Black boy and not a white boy with the same behavioral record is "wrong", so that the judgment of your algorithm aligns with your desired values and not the current system today. This is an obvious case but
In this example, including more data from Black students in your training data won’t solve the problem, because we don’t have enough education systems free from racial bias to provide examples of Black students treated fairly.
explains a similar challenge in biased recruiting: “It may be even more challenging in other arenas to find a target variable that does not encode racial skewing vis-à-vis the actual outcome of concern. In the employment context, for instance, employers want to predict success on the job. But the data on past success may be skewed by the company’s past discrimination in hiring or promotion practices. There is nothing in the past data that reliably represents “job success” in a nondiscriminatory environment.”
To Do: Ask schools you work with about the world they want to live in and reflect on the type of school environment you hope to support with your product.
than other students. This toolkit aims to help you identify historical biases that create this phenomenon so that you can mitigate them rather than automating the existing bias in the system. It may be necessary to modify labels to account for scenarios in which Black boys should not have been suspended, to create a machine learning algorithm that aligns with the community’s values.
Activity 6: Keep Your Dataset Fresh
Datasets are not staticーthey grow and adapt over time. If your dataset is composed purely of student data from 30 years ago, your results won’t be relevant for students today. It is important to craft a forward-looking dataset strategy at the start to ensure your data stays relevant.
To Do: Depending on the type of data you’re using, you may need to update every year or even more often in some cases, depending on your use case. Here are some questions to consider:
Questions About Your Dataset
How often must we update our dataset?
Options: Every 5-10 years, annually, or more frequent.
Most education data can be updated annually, but you should also keep track of policy changes both at the state and local levels or the school level. Are schools shifting attendance or discipline policies to record data differently? Did schools embrace a vastly new type of curriculum focus? For example, students’ experiences learning from home during the COVID shutdowns may drastically impact some of your assumptions.
How much data do I need to add at once?
Options: replace all data vs. segments of data
Say you notice that your model performs poorly on high performing Black students in 8th through 10th grade math classes, and manage to obtain some data that could correct this issue. Do you add only the data points for 8th through 10th graders? Or should you try to correct this problem across all grades? Should you only add the data points for math students? Or is it possible this problem persists across multiple areas of study? Make sure you ask these questions and recalculate all possible scenarios to assess for the impact of incorporating this new data. Go back to the values of the schools you work to ensure the outcomes align with these values.
Who can change the dataset?
Options: Only changes approved by the CTO, only changes approved by the ML anti-bias lead, only changes made by engineers who have x certifications, etc.
A newly hired AI engineer has a theory that the model is performing poorly on high performing Black students in 8th through 10th grade math classes. Should he be allowed to change the model in production? If not, whose approval does he need, and what evidence does he need to show to earn it? What if a senior AI engineer says she agrees with him but someone outside the engineering department has concerns?
Will the dataset get updated for all users at once, or is it updated over time?
When you make a change, you should always test to see the impact of making a change across any potential users to ensure it has the intended effect and no additional unwanted side effects. Options: all users should see this update, only users using the same model where the issue was found, only similar users see this update, etc.
Your company has decided to use one global model and one global dataset across all schools. You have concluded the model is performing poorly on high performing Black students in 8th through 10th grade math classes in a handful of schools, but it appears to be performing correctly for the other schools. Assume your engineers have concluded that changing the dataset would alleviate this problem in the schools with these issues. Do you change the dataset for all schools? Just the schools with issues? Do you change the dataset for one of the schools with issues, see how it goes, then expand to other schools with issues? Make sure you have the expertise on your team to grapple with these questions before they arise.
There are no rows in this table
Summary Questions for Finding Problems
Does your data represent the population of schools you want to work with?
What do your labels actually mean? Are they an actual categorization or a proxy?
Where did your labels come from? What is the history behind them? Who created them? What bias may have influenced them? Would others agree with these labels?
Were the labels cross-checked or validated?
Do the labels align with your company values and those of the schools you work with?
Do you have a plan for where to get new labels as your datasets change?
Section 2: Addressing Problems
So you’ve found some tricky places where bias might come into your product through a dataset you are losing. What can you do now? This is a complex and very important area. These challenges may bring up difficult conversations within your team, but know that the discomfort is worthwhile. Change begins with awareness and requires continuously challenging conversations. Just like people, every dataset has its own “perspective” or bias. It is much better to investigate this bias than to turn a blind eye. Fairness through unawareness usually does not work.
Unfortunately, there’s no magic wand to “unbias” your dataset – this is still an active area of
. However, you can also use strategies to ensure your model is “biased” in a way that aligns with your and your schools’ values. For example, if you recognize that Black students demonstrate slower learning progress because of a suite of challenges they’ve faced in their schools, how can your model provide or recommend additional, tailored support to these students to achieve the same outcomes as other students. This helps you focus on achieving equitable outcomes that school communities believe are fair, rather than blindly treating all students the same and ignoring the fact that your product works better for some students than others.
In this section, we’ll help you identify what is and isn’t fixable and scope the work required to fix it. For the unfixable elements, we encourage you to reconsider whether ML is appropriate. If you choose to proceed, make sure to disclose the problems you couldn’t address. Education is messy, but it is also a team sport. Work closely with schools, students, and parents to tackle problems that cannot be addressed with a technical solution. Your unsolved problems may uncover challenges of which schools were unaware, and your data might help them make a case for change. You should also consider these areas in the
with your team and to remember what should be disclosed to schools, students, and families. Many inequities cannot be addressed or fixed within your technology, but you should partner with schools to not only bring these issues to their attention but also make sure students, teachers, and families are aware of how to use your product in a way that does not perpetuate these biases. You can start with a table like the one below with your team.
Where does it come from?
Can we address it in the dataset?
If no, how should we disclose this?
Not enough ESL students in training data.
Schools we received data from did not have many ESL students.
Model doesn’t work well on students with accents.
Yes, we can acquire more training data from ESL students or students with accents.
If we can’t, we should disclose to schools that our model may not work well for ESL students or students with accents.
Students with the most suspensions and lowest attendance rate are Black boys.
Implicit racial bias in schools causes teachers to suspend Black students more often than white students for the same infractions.
In the schools we received data from, Black students tend to live much farther from the school and their buses often arrive late to school.
Model highlights a disproportionate number of Black students as at-risk.
If product is used to categorize students in academic tracks based on predicted outcomes, model could incorrectly assign Black students to lower academic tracks.
No, we can alter the dataset to amplify examples of Black students with no discipline record and high attendance, but this may cause some Black students to not receive the additional support they need.
The underlying bias must be disclosed and addressed in the design of the app.
Schools should be made aware that a key group of students (Black students) has disproportionately received higher rates of suspension and lower attendance rates.
Make sure you deeply understand why this is happening rather than accepting it as the norm.
Incorporate this understanding into the design of the product as a “risk” such that model predictions never dictate the academic track of a student or blindly label a student as “at risk”.
Very few data points of students with accents also get high quiz scores
Schools do not have adequate support for English language learners
Algorithm might learn that students with accents shouldn’t progress to advanced reading levels or may not make as accurate recommendations for this group of students compared to others.
No, this problem happens because of bias in the school system. Amplifying examples of high performing students with accents may work but it may also throw off the model’s accuracy for students in this group.
The underlying bias must be disclosed and addressed in the design of our product.
Draw schools’ attention to the fact that they may be underserving their students with accents.
Make sure this is acknowledged during training and implementation as an area of inquiry when teachers and students use your product and consider this fact in the design of your product. For example, if you know that students with accents aren’t progressing as fast as the rest of the class, don’t automatically place students on a lower academic track. Instead suggest that students receive additional support in specific language areas.
Less data for Black and Brown students at advanced reading levels
Black and Brown students were historically placed in less rigorous reading classrooms.
Model may learn that this group of students should progress more slowly than others.
No, this problem happens because of bias in the school system. Amplifying examples of Black and Brown students at advanced reading levels may work but it may also hurt Black and Brown students who need additional reading support.
The underlying bias must be disclosed and address in the design of our product.
Make sure schools aware of the fact that they may be underserving Black and Brown students in literacy.
Make sure this is acknowledged during training and implementation as an area of inquiry when teachers and students use your product and consider this fact in the design of your product. For example, if you know that Black and Brown students are traditionally placed in less rigorous reading classrooms, don’t recommend placing students on lower academic tracks as a result of your model. Instead suggest that students receive additional support or be assigned more engaging types of reading content.
There are no rows in this table
Note: This example calls out race and language, but bias could happen with many other features or features that are proxies for race like income, zipcode, or learning challenges and more.
Modify Data to Address Bias
You can use statistical techniques to fix biases you find, such as amplifying low-frequency examples. For example, if a training dataset associates English language learners to lower academic outcomes, you can statistically amplify examples of “successful” students with accents. For instance, take a literacy app that assesses students by listening to them read out loud. Imagine that in the schools that provided you with data, there were only a few English learners, and most of these English learners answered questions incorrectly because they had not received the language support they would require to catch up. As a result, you have only a few data points for students with accents who answered questions correctly. Your algorithms would have a hard time incorporating correct answers with accents and might also learn from these patterns that students with accents are typically wrong. You can address these trends by incorporating more of examples of students with accents who provide the correct answer, or by amplifying the weight of the examples you do have. If you aren’t able to address this within your dataset, disclose these concerns to the schools you work with or to education researchers – these are problems they grapple with on a daily basis.
Note: Fixing datasets requires an understanding of different approaches to “fairness”ーa very active and hotly debated research area. We will discuss this more in the section
referenced above provides a few techniques for mitigating dataset bias.
🎯Conclusion: It's all in the data
It's critical that you understand the full context around the data you collect and use to make conclusions, train your algorithms, and improve your product over time. Decades of racial and socioeconomic bias are embedded in educational outcome data and still impact student outcomes every day. As you build your product on top of new and existing data, make sure you understand how this data is recorded, what factors influence each data point, and what possible outliers might challenge your assumptions. This section provided examples of blindspots that can bias your datasets and suggestions for how to address them before you train your algorithms.
Now that you've evaluated your datasets, let's move on to