Can a Test Ever Be Fair? How Today's Standardized Tests Get Made.

After politics and religion, few issues are as contentious as standardized tests. Opinions run the gamut. To some, standardized testing overwhelms our schools and helps eradicate differences between students. For others, they remain the best way to compare students objectively and hold schools accountable.

Whatever your thoughts, there’s no denying that students are taking lots of tests. Two years ago, U.S. students were taking about eight tests a year.

Predictably, a big business has sprung up around testing. Among those who have turned it into a livelihood are tutors, publishers like Pearson and Scholastic and printers. Then there are the psychometricians—the math savants who design tests and create complex algorithms that attempt to make them fair to all students. Or in other words, “How do we compare students who take different tests as if they had all taken the same test?” asks Mark Moulton.

Moulton has worked in the field for about 15 years with a small family-run company called Educational Data Systems, which makes exams for local districts and high stakes tests for the state of California. He recently spoke with us about testing bias, holding psychometricians accountable for the exams they create and whether the future holds any innovation for a field still dominated by math, language and multiple choice questions.

EdSurge: What might make a test unfair?

Mark Moulton: Here that whole question is, what does fair mean? Getting a low score on a test doesn’t mean that it’s unfair. Or even if one ethnic group as a whole got a lower score than another ethnic group, it doesn’t mean the test was unfair. What you want is a test where the items on it don’t play favorites. They test one trait and one trait only, which is the same for everybody.

If I’m giving a math test and the test includes a bunch of word problems—and it turns out that a third of my test takers don’t know English—then in effect, for those word problems, I’m testing their English ability, not their math ability. That’s an unfair test.

The goal is to clean up the test in such a way that the population you’re aiming at is going to be tested on the same thing and nothing else. That’s the goal. Of course, it’s an ideal, which is never reached entirely.

How does the process of stripping out bias work?

OK, say you’re designing a test for the state of California. You pick your subject area, let’s say it’s language, and you have a set of standards, like the Common Core standards. These are statements about what we want to find out about our students. It specifies what good language ability means.

Then the state hands off to a vendor who writes a test. The vendor tries to write items that are responses to those standards. So they write a bunch of tests items or questions of different kinds.

Afterward, you run it by a panel to see if there is any detectable bias just by reading the questions. A question where boys are likely to know the answer, but girls are not, for example, or where Hispanics are likely to know the answer, but Asians are not.

Then you give a pilot test. Here it starts going into a psychometric mode where you get some test data back, and then you do a psychometric analysis and look for what’s called “differential item functioning.” Psychometric analysis computes the difficulty level of every question. Differential item functioning looks to see if a given test question has a different difficulty for one group of people over another group.

If you find that your question on skateboarding is one that boys find to be an easy question, but girls find to be a hard question, that’ll pop up as a statistic. Differential item functioning will flag that question as problematic.

What’s cool about psychometrics is that it will flag stuff that a human would never be able to notice. I remember a science test that had been developed in California and it asked about earthquakes. But the question was later used in a test that was administered in New England. When you try to analyze the New England kids with the California kids, you would get a differential item functioning flag because the California kids were all over the subject of earthquakes, and the kids in Vermont had no idea about earthquakes.

How do you control for the fact that it’s humans, like you, that are designing this algorithm, and you have biases yourself? Do you take pains to make sure that the people designing these algorithms are a cross section of different ethnicities and cultures?

Yes, God help us all! There comes a point where you’re looking over your own shoulder all the time trying to flag your own biases. And it’s very difficult to do. It’s not like there’s a ton of psychometricians around. You take what you can get.

In terms of the mathematical part of looking for bias, there the algorithm is based on some very simple but very powerful models that tell you if fairness occurred and give you some pointers as to what you need to do to fix it if fairness did not occur. For that part, it doesn’t matter who’s pushing the button on the computer. The answer will come up the same, so that part’s fine. But things like the alignment part—is this test measuring what we think it’s measuring?—those are places where our own biases as test writers, or psychometricians, sneak in, and pure quantitative analysis is not enough to flag them.

Teachers and states and students, obviously, are held very accountable for these tests. Where do you see the accountability of psychometricians coming in? How are you held accountable for getting this right?

We handle, for instance, the CELDT exam for the state of California—the California English Language Development Test—which is a massive test. And the people we interact with are the people at the state, the California Department of Education, CDE. They have their own technical advisory group, and we get hauled in there on a regular basis, as we should, to explain the results of the tests. We try to explain if bad things have happened, what has happened, if some mistake has been made we have to figure out where that mistake occurred and go and fix it and make things right. It’s actually a very grueling and difficult process. We’re held accountable by the state. They check all of our work with eagle eyes and it’s because the state knows that everyone’s checking their work. They’re accountable to the court of public opinion. And other courts.

Accountability does ripple through the system. But even saying that, it’s still an ideal that is not fully met. I’ve not yet encountered a test that didn’t have problems, or where it felt like it was 100 percent responsive to what the community wanted.

It sounds like you give yourself a lot of latitude for making mistakes, but teachers don’t often get that same latitude when they’re held accountable for the tests that you create.

Oh, I know. Especially when teachers are held accountable for the test scores of their students. That’s a very scary position for a teacher to be in. And in an abstract way, it sort of makes sense that the employers of the teachers should have some way of deciding whether the teachers are doing a good job. But there’s a statistical sense where that’s problematic because these test scores have a large standard error around them, which means that it could be quite unfair to teachers. So teachers are sort of in the firing line in a way that we are not.

I do have to say our little company is on the firing lines too. Until one has gone through the sort of the mind-melting despair or trying to explain why scores dipped in a certain year and whether that was some mistake that you made—until you’ve gone through that period of horror and fear, you almost don’t know what accountability feels like. Every now and again you get it sufficiently wrong and it ends up on the front pages of the newspapers. It’s quite a mortifying experience potentially.

Can you talk a little bit more about how mistakes slip into tests?

There are so many different ways for errors to sneak in. And so we have lots of processes to try to avoid them. But it can come in at different levels.

First of all, it could come in at the actual test administration phase. Suppose when printing out the booklets one of the test questions gets split across two pages. You’ve got millions of kids taking this test and now one of these questions is fatally compromised because people aren’t able to see it in its entirety all at once. Suddenly all of the tables that we had put together to score that test are no longer valid. That’s a data collection error.

You could also have a data formatting errors where data comes in but when you sort it, if everyone’s names got scrambled so that the wrong test scores are associated with a given ID and you think you’re testing a kid that you’re actually testing a different kid. That’s sort of a total nightmare.

And then there are the most fearsome errors of all and that is psychometric errors. Those are the hardest to fix and to diagnose. For instance, it’s possible that an item got miscalibrated where the difficulty of the item changed over time so that an item or a group of items which were one difficulty actually had another difficulty. That blows the comparability of the test across years. It can cause weird things to happen with the trend line for peoples’ scores. There’s a whole field of places where you can make a mistake during analysis as a psychometrician.

Have you done a lot of work with computer adaptive testing? How is it changing the field?

In psychometrics, the information that you get about a student is maximized when the student has a 50 percent chance of getting that item right. Which, by the way, is a very unpleasant experience if you’re a test taker because it’s a coin toss whether or not you’re going to get it right. It’s a very stressful test to take. But that is what maximizes the information we get about that student.

If you have a fixed test that’s not adaptive, it means that there are going to be a bunch of questions that are well out of the range of a given student. So that student is basically answering a bunch of questions that aren’t providing us with a lot of information about that student.

Computer adaptive testing fixes that. You start off with some initial ideas about what the students’ ability are, you feed them some questions, and you say, “Ah, okay. The computer says this student is probably up the scale a bit in terms of their ability.” So now you feed him new questions that are more targeted at their ability. That allows you to refine the score of that person further—the scale of scores it’s called. Then you give him a new batch of questions that he is likely to have a 50 percent chance of getting it right.

If you keep doing that, you can get the same precision by giving a kid 15 items on a computer adaptive test that might have taken 40 items on a regular conventional test. Once you know that you’ve gotten that person’s score to a desired level of precision, then the test stops and everybody goes home.

So adaptive testing has some wonderful properties. It makes the test shorter. It makes the test more fair because it guarantees that everyone is being measured with the same degree of precision. It makes the test more secure because you can’t cheat by looking at your neighbor’s test.

That’s why the trend is toward adaptive testing. It’s a reasonable trend. But notice, by the way, it mainly works when your questions are right/wrong. If the questions are graded on scale of one to 10, then it doesn’t provide much benefit.

Have you seen any pressure from states or publishers to measure more of the so-called “soft skills” like grit and resilience and other 21st century skills on standardized tests?

So far it hasn’t trickled down. In the academic areas, and with some vendors, there’s been a realization that there is this whole spectrum of skills and student characteristics that we need to be paying attention to. But I haven’t gotten a sense yet—and I could be wrong—that the testing industry is taking them seriously. The testing industry responds to state RFPs, requests for proposals. For the most part, states want to keep it simple. They want to keep it politically uncontroversial and so the kind of constructs that end up getting measured tend to be your basic math and language and that’s it.

This is a big problem in the field of education. It means that the other subject areas—history, art, music, physical education and the rest of it—get shunted off and underfunded and students aren’t getting the full-rounded exposure that they really need to grow. And that’s not going to change until states find that A) it’s technically feasible to measure all these different areas, B) that the public wants that and C) there’s money available to do it. We really underfund education in this country. It doesn’t look like that’s going to change any time soon.

With ESSA giving states a bit more control, states have more opportunity to change assessments, but maybe they’re not there yet?

The thing is with Common Core, we still have an emphasis on math and language and, to a lesser degree, on science. The message is given out that those things are all that’s important, and we’re partly to blame for that.

One of the reasons why math and language is so dominant in testing is that they’re one of the few things that we know how to test. Because math and language can be mapped out as a steady continuum and construct that grows continually through the grades. It makes it possible to actually come up with measures on those two constructs. It’s much harder to come up with a measure using psychometric tools that are at our disposal. There are things like history where every grade is its own story, and the results of one grade are not comparable to another grade. Like Civil War in one year and European History in another year.

And as a consequence, because of the desire to test and hold schools accountable, it means that we bring psychometrics into the formula. And it turns out that we’re only good at measuring some types of content areas, not others, and that has had the effect of overly narrowing what states ask for in terms of their educational standards.

I’m a supporter of Common Core. And I think what Common Core is trying to do is to make it so that students are held to standards that will line them up for being able to go to college, and be able to think critically, be able to do more things than just answer multiple choice questions.

There’s also the Next Generation Science Standards, which are sort of Common Core applied to science. I see very interesting opportunities there because science really requires this huge mix of modalities. The designers of the Next Generation of Science Standards recognized this. So they defined it as science not according to one construct, but as a mixture of the ability to master cross-cutting concepts, as they call it, as well as domain knowledge and science and engineering practice. Because you really need all those if you’re gonna be prepared to do any kind of science in college.

To be able to build tests that actually deal with that mix of skills and expectations, is, I think, a very good thing. It sort of provides a bridge to other content areas. But I also have to say as a psychometrician who has worked in designing Next Generation Science Standards tests, it is a very difficult job to do well. The existing psychometric tools we’re using are not sufficient to do it well. That’s something where our field has to get its act together to be able to meet the needs of the future.