A Better Way to Design Multiple Choice Tests

Like many issues in education policy, student assessment tends to produce views that crystallize around a false dichotomy. Either our perpetually new and improving multiple-choice tests are the only feasible way to assess students and schools, or the tests reveal nothing—they merely stand in the way of alternative approaches that could easily and accurately evaluate students without making them loathe school.

But there are a range of more nuanced middle grounds between these two extremes. One of them involves a space where we still have multiple-choice tests, but which do more to achieve the two main goals of assessment:

Provide students with the opportunity to reveal what they know;
Allow teachers to discover what students know.

In a new paper, a group of researchers led by Raymond Nickerson of Tufts University argue that multiple choice tests can better achieve these goals through a small change in how test questions are answered and scored. The key is in allowing students to assign weights to different answers based on how likely they think each is the best answer. For example, if a student believes two answers are equally likely to be correct, they could assign them each weights of .5. If the student believe one is slightly more likely to be right, they could assign that one .7 and the other .3.

Allowing students to report their confidence has long been discussed as a way to improve the diagnostic value of multiple-choice testing, but one problem is that the simplest method—multiplying the value of the question by the confidence in the correct answer—doesn’t provide the right incentives for revealing your true knowledge. Under this system students are better off reporting 100% confidence in the most likely answers, even if they are not 100% sure.

Nickerson and his colleague propose using a more complicated algorithm called the spherical-gain rule (SG). With the SG rule, students actually maximize their score by revealing their true confidence in each answer. (The gist of the rule is that it’s based of the square root of the sum of squares of the confidence weights.) While the SG rule is not new, the proliferation of computers now makes it easy for teachers to use it and evaluate the increasingly detailed data it generates.

The researchers tested the SG rule by giving 130 college students a 50-question multiple choice test that focused on basic trivia. Half the students answered using the SG rule and half answered the standard way—by selecting a single response deemed most likely to be correct. Students who reported their confidence weights also had their tests scored conventionally (i.e. full credit if the correct answer was assigned the highest confidence), and with proportion scoring (receive 70% of the points for assigning a weight of .7 to the correct answer.) The researchers found that while the distributions between the conventional and SG groups were similar, scores were higher using the SG rule. This suggests that using the rule did allow students to reveal additional knowledge.

How can SG make testing deliver real, valuable insights? Imagine a situation where half the students answer a question correctly and the other half choose the same wrong answer. It’s possible that every student knew it was one of the two answers, and half merely guessed wrong. But it’s also possible students knew very little about the answers. That is, it may be that two answers were slightly more popular, but that students felt like all five choices were feasible. These outcomes reveal classrooms with drastically different knowledge, but under current conditions both teachers would see identical testing data.

Consider another situation in which half the students choose the same wrong answer. If most students were 40%-50% confident in the wrong answer, it paints a different picture than one in which most students were 90% confident in the wrong answer. Again, having this additional information gives the teacher a better idea of what the students know.

There’s an argument to be made that given the drama associated with testing, adding the additional task of giving confidence weights is unreasonable. And that’s a fair argument; the additional complexity could cause more anxiety. But one could also argue that the anxiety of testing will decrease if students are sure that they’ll receive 50% of the credit if they’re 50% sure of the correct answer. There would no longer be a need to agonize over every single guess. Just put all your knowledge on the page and move on.

The nice thing about getting students to provide each answer with a confidence weight is that the data can be used however a teacher, school, or district wants to use it. Certain teachers may want to focus on the content that led to high confidence in incorrect answers, while some may want to focus on the content in which students were truly left guessing. Correlating these outcomes with particular types of lessons may also allow teachers to better understand which aspects of their teaching are more or less effective.

Using SG scoring would take some time and effort. Teachers would need to be trained and software would need to be built and installed. But this is something for which adequate technology exists right now. Scoring these tests and creating point-and-click tools for teachers to analyze and organize the data is not a heavy lift from a programming or design standpoint.

When talk turns to improving testing, the focus is entirely on trying to make the questions better. But let’s not forget that even if the test questions stay the same, there are still ways to wring more information out of their answers.