Automated Essay Scoring Myths: Part 1

In educational institutions across the globe, there is an ongoing debate over the use of Automated Scoring systems.  I use the term "debate" rather loosely, as it seems more like a clattering of voices at times, often from people completely unfamiliar with Automated Scoring.  The most contentious question is whether these systems should be used in the scoring of high stakes tests.  At PaperRater we've sat on the sidelines and watched this discussion unfold, but feel that now might be a good time to add our 2 cents.  Today we are launching a blog series entitled "Automated Essay Scoring Myths".  This series will examine some of the myths surrounding this technology and explain how it works in the process.  We welcome your feedback.

Myth #1:  Computers Can NOT Grade as Well as Humans

This is one that I hear a lot both in print and when talking to people about Automated Scoring.  Just look at what some people are saying:

  • Les Perelman, former writing professor at MIT: "My main concern is that it doesn't work."
  • Mark Shermis, University of Akron researcher: "It can't tell you if you've made a good argument, or if you've made a good conclusion."
  • Diane Ravitch, research professor at NYU: "Computers can’t tell the difference between reasonable prose and bloated nonsense."
  • Petition of Human Readers: "current machine scoring of essays is not defensible, even when procedures pair human and computer raters."
Meanwhile, others are saying things that might suggest the opposite:
  • Mark Shermis, University of Akron researcher: "A few of them [AES systems] actually did better than human raters."
  • Sandra Foster, lead coordinator W. Virginia: "We are confident the scoring is very accurate."
  • Judy Park, Utah Associate Superintendent: "What we have found is the machines score probably more consistently than human scorers."
Which is the correct answer?  Mark Shermis, the researcher quoted in both sections above, offered a study that analyzed results from Automated Essay Scoring competitions sponsored by the Hewlett Foundation in which several AES systems competed against each other.  A public competition later followed, and the results were stunning.  Both private and public systems were able to score at or above the level of humans!  The Mark Shermis study can be found here.

Of greater weight than Shermis' study, are the real-world results that are being seen.  During my time working for an AES vendor, I participated in the development of AES technology that was used as a "2nd reader" for a particular US state.  What this means is that a human reader scored each response and the computer offered a 2nd score.  If the 2 scores were substantially different, then a 3rd human would set the final score.  Our system graded thousands of responses on over a dozen prompts, grade levels, and of varying topics and lengths.  Amazingly, the computer was more accurate than the human readers on EVERY prompt for EVERY grade level.  However, not every project was this successful.  One trial project for a particular country yielded results where the computer was slightly less accurate than the human readers on some traits, although within reasonable measures of error.  Regardless, the message could not be more clear:  Even in it's infancy, Automated Scoring technology is comparable to humans and it is only going to get better.


Why All the Fuss?

From what I've read and the conversations that I've had, the issues and fears about the quality of computer grading stems from two points:

Humans and Computers Grade Differently

Anyone that has proofread a paper has an intuitive feel for how a human grades.  We start from the beginning and read through a paper looking for errors in mechanics.  We breathe in the words and take note of how it makes us feel based on the expressions presented.  We grasp the subtleties (usually) and also take note of how well arguments are made and supported...

Computers share many of the same approaches to grading, but handle some things differently.  While they do scan the paper for errors in grammar and proper usage of appropriate vocabulary, they may estimate other things like logical arguments, by using statistical analyses of the presence of certain word sets, or the similarity of a given response to another response that the computer has already seen graded.  This can make a skeptical audience a little uneasy, but the results show that it works.  If a simpler physics equation can accurately model a complex physical process, would you demand that an exact simulation be used instead?  No.  Similarly, research and real-world usage are showing that AI Scoring systems are every bit as accurate as humans, even if their approach to grading is different.  Furthermore, emulating a human grader has it's drawbacks, as discussed below.  


Humans Do Not Grade as Well as You Think

This one may sting a little bit, but it has to be said.  Let me explain.  You may be picturing a teacher thoughtfully reading over an essay with a red pen -- pausing for a moment to scribble some wise advice in the margins and then continuing on.  This picture is all wrong when it comes to testing.  Graders, called "readers", are given only a few minutes to read an essay and assign a score, usually on a small scale (e.g., 1-6) and they must adhere to a rubric.  Considering these restrictions, they do remarkably well, but we mustn't forget Alexander Pope's poignant observation: "To err is human".  For all the criticism and fears that I've read regarding Automated Scoring systems, I'm amazed at how we hold ourselves in such high-esteem.  I see the same response to the autonomous vehicles being developed.  I get the feeling that people do not realize that machines do not need to be perfect, just better than the comparable human.  Here are just some of the errors that human readers are prone to:

  • Bias.  We are easily influenced by things in the response that should not matter.  For example, something about the writer might remind us of a loved one and that affects the way we score their writing.  That is just one example, but there are many more.
  • Dynamic State = Mixed Output.  We are a complex, chaotic system and this is a nightmare when it comes to scoring.  Computers win when it comes to being rational and consistent.  How is a score affected by a reader who is hungry?  Sad?  Sleepy?  Hungover?  Where is the public outcry over these human "machines" that offer different grades based on their ever-changing internal state?   
  • Drift.  A number of psychologists and behavioral economists have studied the way that humans lack an objective measurement system.  Everything is comparative.  The perceived shade of a color is different when next to a darker vs lighter color.  The length of a line seems shorter if it is next to a longer line.  How well written does an average essay seem after having just read five poorly written essays in a row?
  • Egregious Inconsistency.  The previous two items deal with inconsistency, but this deserves it's own bullet point.  A computer will always give the same output for the same input.  This seems like an obvious and basic prerequisite for any grader; yet, for short answer responses, it is common for human graders to give different scores to the exact same answer.  Let me repeat that, "It is common for human graders to give different scores to the exact same answer."  In fact, I once saw the same answer receive the absolute lowest score from one human reader and the absolute highest score from the other reader.  This seems to me to be the epitome of a poorly designed grading system and yet it is something that is quite common for human readers. 
  • Lack of Precision.  Humans are great at generalizing and connecting relations, but very poor at making calculations in terms of speed and accuracy.  Forcing a human to quickly grade an essay and adhere to a lengthy rubric is simply a mismatch of a human's innate capabilities.  Computers, on the other hand, are quite adept at scanning and processing information, tallying items, counting matches, and making calculations with both speed and accuracy.  This is a key advantage of a computer when a detailed rubric is used and time is limited.  
The point of this section, is not to bash us humans, but to offer a candid view of our flaws, and to help us recognize that combining the different approaches of humans and computers offers us the best path forward.  And this is precisely the approach that Automated Scoring systems are taking.

What About PaperRater?

I would like to end this article with some information on our own FREE Automated Essay Scoring engine affectionately named Grendel.  Grendel is a general scoring system that is not calibrated to specific prompts, such as the systems used in high stakes testing.  It is also designed for speed and limited usage of resources, so the accuracy is below that of a human grader.  Nevertheless, we do plan on offering a more accurate system in the future for premium users.  In the meantime, Grendel offers a general score along with automated feedback on grammar, spelling, word choice, and much more.  We have received hundreds of emails from educators that are using PaperRater to allow their students to receive on-demand feedback before turning their papers in.  My favorite message came from an English teacher who said that PaperRater is the most useful tool that she has used in 25 years of teaching.  We hope you will give it a try!