Why is it important to consider whether an assessment is biased against a sub-group?

Why is it important to consider whether an assessment is biased against a sub-group?

Resource Type
Developed By
National Center on Intensive Intervention

In this video, John M. Hintze, Professor in the Department of Student Development at the University of Massachusetts Amherst explains why it is important to consider whether an assessment is biased against a specific sub-group.






Question: Why is it important to consider whether an assessment is biased against a sub-group?

Answer: That’s a difficult question and one that is not very easily answered, but I think that there are probably three helpful factors that practitioners or users can think about when trying to weigh the evidence and make a decision on that. The three things that I typically think about are—

  1. Are there any legal or regulatory statutes or regulations that might preclude the use of a particular tool or instrument against a sub-group.
  2. A second factor might be the kind of statistical evidence that we might have about the tool and whether or not there is evidence that maybe we don’t want to use it for a particular purpose, and
  3. The third is what I usually refer to as adverse impact factors.

I’ll describe what each of those are pretty quickly. Most people have some experience with the first notion about regulatory or law issues. So for example, with reauthorization of PL: 94-142 back in 1994 it was really the first time that we saw verbiage in the federal code that told us that we needed to use multiple instruments in order to safe-guard against any kind of bias against certain people or groups. Another example people typically experience somewhere in their training is the Larry P. vs. Riles case, which was a class-action suit in California and the results of that were that practitioners couldn’t use cognitive or IQ tests for the identification of African American boys for the sole purpose of identification for mental retardation, as we called it at that time. So those are two examples of either regulatory or law factors that might either preclude or sway our decisions.

The second area and this is an area where I think the Center does a really good job is —is there statistical or psychometric evidence that a tool is maybe ill-used for a particular subgroup or population?  Typically what we look for there are psychometric properties or reliability and validity. We first want to make sure that the tool has adequate properties with respect to those issues. But then secondly is whether or not the tools then are disaggregated? Are those psychometric properties disaggregated for categories that we might be interested in? So for example if we had a tool that had an overall reliability property of point nine but we were really interested in whether or not—what was the reliability of this tool for use with girls versus boys—And we found that there was a divide between those two levels of reliability, it might suggest that we probably don’t want to use or we might want to use with caution that tool for making decisions for one group or another. The other kind of statistical evidence that we might want to look for is whether or not the tool provides differential prediction to yet another variable. So for example if we are using a tool or measure to predict performance yet on a second measure and the tool either over or under predicts scores or performance on the second measure as a function of category, then that might be evidence again that the tool might be bias for a particular sub-group or category of people. And this is in line with what Arthur Jensen talked about with the use of IQ tests.

The third approach and potentially the most important is what I call adverse impact. And what I mean by that is, do the results and scores on a test and more importantly the interpretation of those scores lead to certain categories or sub-groups being advantaged or disadvantaged as compared to each other? So for example, if I were to take the average height of NBA players and compare it to the average height of all men in the nation, there would probably be a difference in those scores. Now is the ruler in that case then bias, probably not, it probably represents the natural order of things. But then if I made set criteria that advantaged certain people, such that only people who were at minimally six foot four or above could qualify to play in the NBA, then the application of that standard on the observed scores would lead to people being six foot four or above being advantaged in that situation. So the third important part is —that not only are there score differences—and sometimes those score differences in and of themselves are important because sometimes they do point to potential bias in the way that the test was made—but importantly as well is the way the scores are interpreted, do they lead to one group being advantaged or disadvantaged as compared to another? So just quick summaries, the three things I would look for when trying to weigh my decision is are there any statutory, law or regulatory factors that might weigh into the decision. Is there statistical evidence that the test or tool is biased for or against a category of people? And then lastly do the results of the score and the interpretation, can they be used in a way to advantage or disadvantage a certain group of people.

Resource Type
DBI Process
Progress Monitoring
Student Population
English Language Learners/Culturally and Linguistically Diverse
Trainers and Coaches
State and Local Leaders
Higher Education Faculty