Tuesday, December 16, 2014

Automated Essay Scoring Myths: Part 1

In educational institutions across the globe, there is an ongoing debate over the use of Automated Scoring systems.  I use the term "debate" rather loosely, as it seems more like a clattering of voices at times, often from people completely unfamiliar with Automated Scoring.  The most contentious question is whether these systems should be used in the scoring of high stakes tests.  At PaperRater we've sat on the sidelines and watched this discussion unfold, but feel that now might be a good time to add our 2 cents.  Today we are launching a blog series entitled "Automated Essay Scoring Myths".  This series will examine some of the myths surrounding this technology and explain how it works in the process.  We welcome your feedback.

Myth #1:  Computers Can NOT Grade as Well as Humans

This is one that I hear a lot both in print and when talking to people about Automated Scoring.  Just look at what some people are saying:

  • Les Perelman, former writing professor at MIT: "My main concern is that it doesn't work."
  • Mark Shermis, University of Akron researcher: "It can't tell you if you've made a good argument, or if you've made a good conclusion."
  • Diane Ravitch, research professor at NYU: "Computers can’t tell the difference between reasonable prose and bloated nonsense."
  • Petition of Human Readers: "current machine scoring of essays is not defensible, even when procedures pair human and computer raters."
Meanwhile, others are saying things that might suggest the opposite:
  • Mark Shermis, University of Akron researcher: "A few of them [AES systems] actually did better than human raters."
  • Sandra Foster, lead coordinator W. Virginia: "We are confident the scoring is very accurate."
  • Judy Park, Utah Associate Superintendent: "What we have found is the machines score probably more consistently than human scorers."
Which is the correct answer?  Mark Shermis, the researcher quoted in both sections above, offered a study that analyzed results from Automated Essay Scoring competitions sponsored by the Hewlett Foundation in which several AES systems competed against each other.  A public competition later followed, and the results were stunning.  Both private and public systems were able to score at or above the level of humans!  The Mark Shermis study can be found here.

Of greater weight than Shermis' study, are the real-world results that are being seen.  During my time working for an AES vendor, I participated in the development of AES technology that was used as a "2nd reader" for a particular US state.  What this means is that a human reader scored each response and the computer offered a 2nd score.  If the 2 scores were substantially different, then a 3rd human would set the final score.  Our system graded thousands of responses on over a dozen prompts, grade levels, and of varying topics and lengths.  Amazingly, the computer was more accurate than the human readers on EVERY prompt for EVERY grade level.  However, not every project was this successful.  One trial project for a particular country yielded results where the computer was slightly less accurate than the human readers on some traits, although within reasonable measures of error.  Regardless, the message could not be more clear:  Even in it's infancy, Automated Scoring technology is comparable to humans and it is only going to get better.


Why All the Fuss?

From what I've read and the conversations that I've had, the issues and fears about the quality of computer grading stems from two points:

Humans and Computers Grade Differently

Anyone that has proofread a paper has an intuitive feel for how a human grades.  We start from the beginning and read through a paper looking for errors in mechanics.  We breathe in the words and take note of how it makes us feel based on the expressions presented.  We grasp the subtleties (usually) and also take note of how well arguments are made and supported...

Computers share many of the same approaches to grading, but handle some things differently.  While they do scan the paper for errors in grammar and proper usage of appropriate vocabulary, they may estimate other things like logical arguments, by using statistical analyses of the presence of certain word sets, or the similarity of a given response to another response that the computer has already seen graded.  This can make a skeptical audience a little uneasy, but the results show that it works.  If a simpler physics equation can accurately model a complex physical process, would you demand that an exact simulation be used instead?  No.  Similarly, research and real-world usage are showing that AI Scoring systems are every bit as accurate as humans, even if their approach to grading is different.  Furthermore, emulating a human grader has it's drawbacks, as discussed below.  


Humans Do Not Grade as Well as You Think

This one may sting a little bit, but it has to be said.  Let me explain.  You may be picturing a teacher thoughtfully reading over an essay with a red pen -- pausing for a moment to scribble some wise advice in the margins and then continuing on.  This picture is all wrong when it comes to testing.  Graders, called "readers", are given only a few minutes to read an essay and assign a score, usually on a small scale (e.g., 1-6) and they must adhere to a rubric.  Considering these restrictions, they do remarkably well, but we mustn't forget Alexander Pope's poignant observation: "To err is human".  For all the criticism and fears that I've read regarding Automated Scoring systems, I'm amazed at how we hold ourselves in such high-esteem.  I see the same response to the autonomous vehicles being developed.  I get the feeling that people do not realize that machines do not need to be perfect, just better than the comparable human.  Here are just some of the errors that human readers are prone to:

  • Bias.  We are easily influenced by things in the response that should not matter.  For example, something about the writer might remind us of a loved one and that affects the way we score their writing.  That is just one example, but there are many more.
  • Dynamic State = Mixed Output.  We are a complex, chaotic system and this is a nightmare when it comes to scoring.  Computers win when it comes to being rational and consistent.  How is a score affected by a reader who is hungry?  Sad?  Sleepy?  Hungover?  Where is the public outcry over these human "machines" that offer different grades based on their ever-changing internal state?   
  • Drift.  A number of psychologists and behavioral economists have studied the way that humans lack an objective measurement system.  Everything is comparative.  The perceived shade of a color is different when next to a darker vs lighter color.  The length of a line seems shorter if it is next to a longer line.  How well written does an average essay seem after having just read five poorly written essays in a row?
  • Egregious Inconsistency.  The previous two items deal with inconsistency, but this deserves it's own bullet point.  A computer will always give the same output for the same input.  This seems like an obvious and basic prerequisite for any grader; yet, for short answer responses, it is common for human graders to give different scores to the exact same answer.  Let me repeat that, "It is common for human graders to give different scores to the exact same answer."  In fact, I once saw the same answer receive the absolute lowest score from one human reader and the absolute highest score from the other reader.  This seems to me to be the epitome of a poorly designed grading system and yet it is something that is quite common for human readers. 
  • Lack of Precision.  Humans are great at generalizing and connecting relations, but very poor at making calculations in terms of speed and accuracy.  Forcing a human to quickly grade an essay and adhere to a lengthy rubric is simply a mismatch of a human's innate capabilities.  Computers, on the other hand, are quite adept at scanning and processing information, tallying items, counting matches, and making calculations with both speed and accuracy.  This is a key advantage of a computer when a detailed rubric is used and time is limited.  
The point of this section, is not to bash us humans, but to offer a candid view of our flaws, and to help us recognize that combining the different approaches of humans and computers offers us the best path forward.  And this is precisely the approach that Automated Scoring systems are taking.

What About PaperRater?

I would like to end this article with some information on our own FREE Automated Essay Scoring engine affectionately named Grendel.  Grendel is a general scoring system that is not calibrated to specific prompts, such as the systems used in high stakes testing.  It is also designed for speed and limited usage of resources, so the accuracy is below that of a human grader.  Nevertheless, we do plan on offering a more accurate system in the future for premium users.  In the meantime, Grendel offers a general score along with automated feedback on grammar, spelling, word choice, and much more.  We have received hundreds of emails from educators that are using PaperRater to allow their students to receive on-demand feedback before turning their papers in.  My favorite message came from an English teacher who said that PaperRater is the most useful tool that she has used in 25 years of teaching.  We hope you will give it a try!  






Wednesday, July 9, 2014

Even Easier to Use!


CAPTCHA Woes

One of the guiding principles at PaperRater is to make things simple and painless to use.  No signups, no logins, no payments, no three minute wait for results...  We fancied ourselves rather satisfactory in this regard.  That is, until you told us otherwise.  We were shocked to discover that you do not like squinting at images and typing in crooked letters.  And we were at least a small bit saddened when we heard that you do not share our joy in deciphering blurry house numbers.  So, it is with mixed emotions (and sarcasm) that we officially announce the end of reCAPTCHA for most users of our automated proofreading and plagiarism detection tools.

What exactly does this mean? 
A week ago, all users of our site were confronted with the dreaded reCAPTCHA before submitting text into our automated proofreader or our plagiarism checker.  As of today, it has been removed from these tools.  However...this does not mean that CAPTCHA is completely gone:

  • Other parts of the site may still use reCAPTCHA (e.g., contact form)
  • reCAPTCHA may still be displayed if you are suspected of spamming (either by the content you submit, or by the number of submissions coming from your IP address)
  • We may use a less annoying CAPTCHA in the future (one that is not reCAPTCHA), if needed

Why was reCAPTCHA used in the first place?
CAPTCHAs are used to identify a visitor as a human, rather than a bot.  Bots represent a definite problem for our site because 1) we are free, and 2) our services require a lot of computing power.  In other words, bots cost us more money than they do most sites.  Nevertheless, we have plans to defeat the bots w/o forcing most human visitors to enter a CAPTCHA.

Other news in usability
Perhaps not as celebrated as the end of reCAPTCHA...we have also decided to remove the title field.  For most users, we believe this field is unnecessary and just one more obstacle to a quick and painless submission process.

Thanks for reading this far!




Monday, November 4, 2013

Plagiarism Detection Changes

If you've been a regular user of PaperRater, then you may already be aware that we've been struggling with issues in the plagiarism detection module.  We first ran into problems when using a 2nd-tier search API that powered this feature, but we were able to switch to Google and restore service.  This was great for a time, but led to even worse problems as Google accidentally killed our subscription at one point, and, more recently, they have set a very low limit on API requests, which has caused our plagiarism check to have issues later in the day, while working flawlessly in the mornings.
After temporarily disabling the plagiarism check for the past few days, we are rolling it out again today with the Bing Search API powering it under the hood.  We hope this will yield better uptime, but we have already found bugs with their phrase queries, about which we've contacted them.  Feel free to contact us with any feedback regarding this rollout.  Thank you for the patience you've shown as we continue to work through these issues.  And please continue to spread the word about this free resource.  Our team is working hard to deliver a top-notch product that is accessible to all.  But all funds are currently devoted to development and operations, so we need your help to spread the word!  Linking to our website wouldn't hurt either.  :-)   Thanks!

UPDATE  Nov. 10, 2013:  We received a lot of feedback in the few days after this was posted (thank you!).  We responded to this feedback by making further enhancements to the plagiarism detection, which we released near the end of this week.  Results are not optimal, but the dissatisfaction rate has dropped significantly.  We will continue to rollout other enhancements to the plagiarism checker in the weeks ahead that should help address accuracy.



Wednesday, October 5, 2011

Printable Summary is Here

Most visitors to our site probably don't notice the small black, vertical tab titled "feedback" on the right side of the page.  This humble button wields a lot of power in determining where we focus our development efforts.  It allows users to vote on ideas or features that they would like to see in our service.  And, of course, you are welcome to create a new suggestion here as well.  For quite some time the request for a "Printable Summary Report" garnered many votes, so we are glad to announce that this has been incorporated into the website.  Just look for the "printable summary report" link under the analysis links.  The printable report includes most of the analysis results from the dynamic report (some were removed for brevity).  Thanks for all who requested this feature.  We welcome you to suggest and vote on new features that you'd like to see offered.

Wednesday, December 1, 2010

title validation - how not to write a title

We cleverly wrote our blog post title in all lowercase to highlight the fact that our PaperRater service now includes title validation. 
[loud clapping] 

What does this mean?

In the past, anyone could submit a paper with a terrible title -- too short, too long, not properly capitalized, etc. And yet there would be no word of advice from us.  Today, all this has changed, so beware if you plan on submitting papers with shoddy titles.  We've got our eye on you!

Wednesday, November 10, 2010

Automated Grading has Arrived

Automated grading of papers has been one of the most requested features and certainly the one that has had our engineers working the most hours.  Consider the difficulty in attempting to grade a paper when...

1) You do not know the assignment topic
2) You do not know the recommended length
3) You are a computer with limited knowledge of the meaning of words

Nevertheless, we've found the Auto Grader to be nearly as accurate as human graders for most papers.  We do note that this grade should be considered a partial grade as it incorporates grammar, spelling, word choice, and style, but not the author's arguments, logic, organization, and ideas.  The latter will still need to be examined by a human.


We do hope some day to provide information into the technical side of this service.  However, for now we are busy adding more features that we hope you will love.  If you would like to suggest a feature, please click here.

Friday, October 22, 2010

New: Speedy Plagiarism Checker

Plagiarism detection has been included in the Paper Rater service from Day 1, but we recognize that some users would like to use the plagiarism checker by itself -- separate from the automated proofreading.  So, yesterday we quietly launched the standalone Plagiarism Checker as our response.  This tool quickly delivers an originality report without the other information provided by the grammar checker tool.

Why would someone want ONLY plagiarism detection?

 Most students prefer to run a complete check of their papers including plagiarism, grammar, spelling, word choice, and style.  However, teachers are often interested in checking only the originality of the document.  The snappy response offered by the Plagiarism Checker gives them exactly what they need.


Comments are welcome...