Testing & Evaluation

This section explains how to consider racial bias in the way you test and evaluate both your model and your overall product. Model Testing and User Testing are two critical ways for companies to identify racial bias in AI used in edtech. At the end of the day, testing for racial inequity is the most powerful way to understand the impact of your technology on Black and Brown students. Even if you design and develop products in an equitable way, you must still test for disparate outcomes both in your model's output and across outcomes for students who use your product. Whatever you find, it is far better to be open and transparent with schools and their communities, than to pretend your product doesn’t play a role.
After this section, you'll look at the of your product.
🎯 Goals
Assess your test dataset for racial bias and ensure it’s effective for your target population
Evaluate and select the best ML model(s)
Design testing scenarios that account for racial equity
✅ Activities for Testing and Evaluation
We will cover both activities to assess for racial bias during model testing and user testing in this section.
Activity 1: Test Your Model Outputs
Is My Test Dataset Good?
Make sure your test dataset, the you use to test how well your ML algorithm works, is representative of your target population, even (and especially) in cases of protected classes like race and family income. This may seem counterintuitive, but it is very important that the populations with which you train and test your algorithms share the same challenges in an educational context as the students who will eventually use your product. For example, if your test dataset is from pilot schools that are majority white and affluent, but you hope to deploy your product across public schools in America, you won’t be able to evaluate your models’ applicability to this larger population. This would also happen if you use test data from a particular type of school (ex. charter schools or private schools) or geographical regions (ex. rural vs. urban) and is especially relevant for global products that are designed and developed in the western hemisphere.
Compare the demographic attributes in your test dataset against your target population. Some examples include race, family income, zip code, attendance status, language, digital/mobile access, urban/suburban/rural, accessibility, school type, class size. Investigate the differences. Remember that you don’t need to know these specific data types of individual students, just the overall breakdown of your population. Ask schools for this information and explain to them why it’s important for your model.
Share any differences with the schools you work with. If you can’t get access to this data in aggregate from schools, be transparent about the demographics of data you used for training and testing when you talk with schools. that products designed for a wealthy suburban population don’t work for lower-income students in urban or rural areas. Letting a school know that you did or did not test your algorithms on datasets that include 50% Black and Brown students is meaningful to their efforts to prioritize equity. Discuss with them how differences between their own population of students and the ones in your test data might impact outcomes. How might this impact their use of your product? You likely can’t get a perfectly representative dataset, but exploring these questions will help you catch many issues.

Activity 2: Compare Outcomes and Experiences for All Students
If you are a small or tightly resourced edtech company, we understand you will not be able to run a double-blind, mass-scale, randomized controlled trial study to assess bias. That doesn’t get you off the hook! The best and only way to account for racial equity in your product is to prioritize this assessment during testing. Make sure you have a significant enough number of Black and Brown students and teachers who will test your product early enough that you can incorporate their perspectives into your product design and development. You should actively test for differences in both experience and outcome of students of different races and incomes.
Even if you aren’t able to collect race or language data for individual students, you can always get the demographic breakdown of the schools with which you test or schools to which you sell your product. You should also incorporate urbanicity, as show that products designed for urban or suburban students don’t work for rural students and vice versa. It’s critical that you take this step to across demographic groups; unawareness is not sufficient.
In reality, this is not just the right thing to do – it’s also good for business. Large, multi-billion dollar companies like Google and IBM understand this well. Corporations have large product inclusion teams to ensure products are designed and developed for diverse users (a.k.a. paying customers). This may feel like a lot of work, but the
of ignoring these differences can be very high.
Activity 3: Devise An Equity-Centered Test Plan
You already write internal test plans and external pilot testing for your product to identify bugs within your product. Use this time to assess for racial equity as well. Of course, you should have a diverse group of internal and external users who test your product, but you can also ensure that the scenarios in which you test incorporate a diversity of experiences.
If you’re not sure where to start, you can use the below questions to evaluate how equitable your existing testing plans are. Your users can be a great help in finding holes in your testing plan. Want to be more thorough? Share your testing scenarios you plan to use with a diverse set of schools, students, and families to get feedback on anything you might have missed. This doesn't just apply to race or income. It also applies to learning differences, accessibility, and more. The more transparent you are, the more equitable the industry can become, together.
Questions to Ask Your Team
Do we test for differences in learning outcomes based on students' race or income? If you don’t have this data for individual students, you can at least test based on individual school demographics which are public information. This should be the bare minimum.
Do we have a plan to get ongoing feedback at multiple stages of development? This should include early prototypes, beta versions, final products, and ongoing testing after implementation.
Do we have a large enough percentage of Black and Brown students of diverse family incomes testing our products? Teachers?
Are we testing our product in across learning environments (ex. loud background noise, intermittent sessions, in transit, etc.)?
Have we ensured that testing partners feeling comfortable providing detailed, honest feedback?
What are the gaps in personal experience of people testing our products internally? How can we acquire these perspectives part of our testing plan?
Do our internal testing scenarios assess user experience in addition to algorithm effectiveness? Do we incorporate the experience of students, in addition to teachers or parents?
Do our internal testing scenarios imitate environments without connectivity, with weak, or intermittent connectivity?
Do our internal testing scenarios account for lack of familiarity with devices?
Do our internal testing scenarios check for racial, cultural, and socioeconomic relevance in our content?
Do our internal testing scenarios ensure intuitive opportunities for explicit and implicit feedback from teachers and students?
Do our internal testing scenarios test for transparency of algorithms? Do users understand how the product works and what recommendations are based on?
🎯Conclusion: Test, Test, and Test Some More!
The only way to be sure of your outcomes is to test them! As we've discussed, you should test both your model's outputs and your product's outcomes. Prioritize users and scenarios that will ensure you test your product across a diversity of experiences and perspectives. While it may feel easier to not deal with sensitive data points altogether, explicitly testing for racial disparity is the only way to know whether or not your efforts have been worthwhile. If you aren't sure the efforts in previous sections are worth it, introduce new test cases and see what you find.
Now, let's look at how your product is actually .
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.