The Believability Barrier: Automated Essay Scoring

For a new technology to make it in the market it must hurdle three big barriers. One: Are potential customers aware of it? Two: Does the technology work? Three: Do potential customers actually believe that the technology works?

Many times, the last is the hardest to overcome.

Just consider the case of poor, and poorly described, artificial intelligence applications. For a long time, the de facto definition of AI was essentially, “That stuff computers can’t do yet.”

Yet technologies stuck in this particular uncanny valley do occasionally manage to get free--if the conditions are right. And that may be positive news for edtech’s current believability-challenged poster child: automated essay scoring.

Rock ‘em Sock’em Robots

Automated (or “machine”) essay scoring isn’t new. I worked with a Boulder, Colorado-based company when it was acquired by Pearson a decade ago that, at the time, had already gone far beyond the crude word-and-sentence-length counting approach of other so-called “intelligent” essay graders. I became comfortable, if not entirely conversant, with the concepts of natural language processing, n-dimensional word spaces, and contextual linguistic relationships.

In the ten years since, research into making automated essay scoring more accurate has exploded. The William and Flora Hewlett Foundation sponsored a competition that pitted nine scoring engines against each other. Another online public competition put roughly 150 teams to the test with their technological creations. The computer scientists at edX, best known for a platform for Massively Online Open Courses (MOOCs), are developing their own open-source automated system called Enhanced AI Scoring Engine (EASE).

Most, if not all, of these systems requiring training. Much as you might train your voice recognition software to better understand you, essay scoring software needs to see dozens or hundreds of human-graded essays properly scored in order to learn, based on a particular “prompt” or question, what’s a good essay versus a bad one. Once trained, the various competitions and outside analyses have found that humans and the best systems today are pretty equivalent in scoring accuracy.

I’ve got your, I mean its, back

That doesn’t mean automated essay scoring should fully replace humans. But a well-oiled machine grader can work alongside people, each backing up the other: humans, to rescue an incredibly creative essay that defies straightforward evaluation or to provide deep feedback; computers, to flag a suddenly tired, inconsistent human scorer and to provide basic feedback.

(Personally, I might have preferred the latter for a school creative writing course in which my teacher rejected my science-fiction short because it wasn’t in a style of which he approved--a story I turned around and sold to a science-fiction magazine.)

Automated essay scoring is finally gaining traction. West Virginia has been using CTB/McGraw-Hill’s engine, Utah has applied Measurement Incorporated’s technology since 2010, and Florida plans to engage American Institutes for Research’s AutoScore for its new statewide writing assessment. PARCC reportedly is considering Pearson’s engine for its Common Core assessments. Typically, the automated system is a “second reader” alongside a human scorer; if the two disagree, the essay gets kicked upstairs to another human to review.

And aside from tests, automated essay evaluation engines are increasingly used to encourage student writing practice under teacher supervision, in which teachers can turn certain feedback features on or off.

BABELing away

Yet the lone loud voice of Les Perelman still gets more attention than the advances made in the technology. Most recently, students at MIT (where Perelman is a former director of undergraduate writing) helped create the media-bait-friendly BABEL (Basic Automatic B.S. Essay Language) generator to “fool” an automated essay grader with verbose nonsense sentences. What the triumphant headlines ignored, but Perelman himself was honest enough to acknowledge, is that in this case the only automated scoring tool being fooled was Vantage Learning’s IntelliMetric, the single commercial engine Perelman admitted he was able to access.

It’s like judging all cheese quality on a block of Velveeta, because that’s all the store carries.

Perelman’s continued one-note (or one-engine) criticisms aside, there’s always understandable distrust of technologies that have the potential to replace people--even if, when used properly, they can backstop them and add strength to strength. This unease seems especially strong in the realm of learning, in which we’re not educating drones. We’re educating cuter and smaller iterations of ourselves.

Nudging tech out of the valley

How does this tech, or any edtech, surmount this believability barrier?

Well--first and most important--it has to work. As a former colleague once noted to me, nothing will kill a bad product faster than good marketing. Overpromising is a bad idea.

Then there has to be a nudge.

In higher education, online remote proctoring faced comparable challenges. There appeared to be a belief that nothing could be better than having a human in a physical room to monitor dozens or hundreds of students taking tests. That, despite lots of rational evidence that proctors don’t like to confront students, that it was far too easy to cheat in a crowd, and that the online proctor-to-student ratio is much lower.

Put test-takers in front of a camera? How is that better than an in-person presence? But Douglas Winneg of Software Secure said resistance to the concept changed about a year ago. It began to fade, he said, “not so much [due to] a specific event--rather a feeling of momentum.”

As colleges added more online courses, William Dorman, CEO of Kryterion, said the turning point in online proctoring may have been pressure from accrediting bodies that wanted security to go beyond passwords. That apparently raised remote proctoring’s profile overall, helped highlight differences in proctoring technology approaches, and led to a very human motivator: peer pressure. “As some entities started to use [online] proctoring,” Dorman noted, “others felt they needed to join in.”

It’s possible that’s where edtech is with automated essay scoring. There are critical caveats. If there’s appropriate use (to allow for more writing practice and feedback in instructional situations, and to support human scorers in assessment situations), if it’s applied cautiously and equitably (no dystopian future, please, in which poor schools get robo-graders instead of, instead of in addition to, human teachers) and if only the best, not the cheapest, automated engines are used.

Ultimately, humans should have control and the final say over any evaluation that has consequences. And any engines should only be used for the purpose for which they were designed and trained. But that is a separate matter from believing they’re as good as someone rushing to grade a pile of writing assignments in the current environment. I’d rather have deep, personal feedback from an expert instructor every time. But if that’s not guaranteed, I’ll take multiple sources of feedback, human or not.

If application is tightly defined, then maybe we can leap the belief barrier. And give a technology both rational and emotional acceptance.

Until, of course, the next cool edtech AI comes along.