Thursday, February 5, 2015

Automated Essay Scoring Myths: Part 2

Automated Scoring Myth #2:  Jobs Will Be Lost

This is our second article in this series, so please click here if you missed Part 1.  Note that I will be frequently using the abbreviation AES to refer to Automated Scoring / AI Scoring / Automated Essay Scoring.  Let's begin...

After making the case for the accuracy of Automated Essay Scoring (AES) systems in Part 1, it may seem that the natural consequence of AES would be the loss of jobs.  In particular, teachers and graders (called "readers") may feel vulnerable, as is evidenced by the petition created by a group of readers in 2013.  Each of these roles is worth examining separately since they are quite different in purpose and activities.

In Part 1, we discussed the role of AES in high stakes testing, but AES is also an important trend in the classroom. When I tell people about my work on AES, the response that I get is sometimes to the effect of, "So, you are creating technology to replace teachers." This couldn't be further from the truth.  If 10% of a teacher's time is spent grading essays, would this enable us to have 10% fewer teachers?  It's not illogical to jump to that conclusion, but the math doesn't add up when it comes to teaching students how to become excellent writers. The reality is that writing is a craft that takes practice and feedback, just like any other task.  But the time that is required to grade papers causes writing instructors to offer fewer writing assignments with less feedback than is optimal. Enter AES -- a valuable tool that empowers teachers to give more writing assignments and similarly allows students to receive more feedback. AES does not replace the teacher, it's just another tool that the teacher can use. In fact, it may be the best tool!  Some teachers have told us that they mandate usage of PaperRater by their students before the teacher even sets eyes on each paper. PaperRater takes care of checking grammar, spelling, word choice, and more, which frees the teacher up to help each student express themselves with clarity and develop their own distinct flair.

Readers (a.k.a. Graders)
The issue of jobs in regards to readers employed by testing institutions is a bit more opaque.  It is true that when a computer scores a response, then that is one less response that will be scored by a human reader.  But that is not the whole story.  AES systems must be "trained" for each prompt that they are expected to grade, and this process requires that human readers score a number of responses (perhaps 600-2000).  The computer then uses this training set to build a model that it can use to score future responses.  This means that human readers are inextricably tied to the AES technology for each and every prompt.  Because of the expense associated with human readers, writing assignments have been excluded from most standardized tests that students take each year.  But, thanks to AES, this may be changing.  Large groups of school systems in the U.S. and abroad are evaluating AES technology and vendors with the intention of incorporating written assessments (short answer and essay) into standardized testing in a wide variety of subjects from Biology to English Composition.  If successful, this will represent incredible demand for the scoring of written responses by both humans and computers.  Essentially, AES would be "growing the pie", rather than just taking pieces of the pie away from human readers.  So, it's my belief that AES will result in more jobs for human readers, rather than less jobs.  However, I do concede that the future is much less clear in this area.

About PaperRater

As in Part 1, I am including a shameless plug for our free Automated Essay Scoring tool. Students and teachers appreciate the immediate feedback that they receive from PaperRater. You will not find another free tool that offers so many benefits including grammar check, spelling check, analysis of word choice, automated scoring, and plagiarism detection.  We hope you will give it a try!  

Tuesday, December 16, 2014

Automated Essay Scoring Myths: Part 1

In educational institutions across the globe, there is an ongoing debate over the use of Automated Scoring systems.  I use the term "debate" rather loosely, as it seems more like a clattering of voices at times, often from people completely unfamiliar with Automated Scoring.  The most contentious question is whether these systems should be used in the scoring of high stakes tests.  At PaperRater we've sat on the sidelines and watched this discussion unfold, but feel that now might be a good time to add our 2 cents.  Today we are launching a blog series entitled "Automated Essay Scoring Myths".  This series will examine some of the myths surrounding this technology and explain how it works in the process.  We welcome your feedback.

Myth #1:  Computers Can NOT Grade as Well as Humans

This is one that I hear a lot both in print and when talking to people about Automated Scoring.  Just look at what some people are saying:

  • Les Perelman, former writing professor at MIT: "My main concern is that it doesn't work."
  • Mark Shermis, University of Akron researcher: "It can't tell you if you've made a good argument, or if you've made a good conclusion."
  • Diane Ravitch, research professor at NYU: "Computers can’t tell the difference between reasonable prose and bloated nonsense."
  • Petition of Human Readers: "current machine scoring of essays is not defensible, even when procedures pair human and computer raters."
Meanwhile, others are saying things that might suggest the opposite:
  • Mark Shermis, University of Akron researcher: "A few of them [AES systems] actually did better than human raters."
  • Sandra Foster, lead coordinator W. Virginia: "We are confident the scoring is very accurate."
  • Judy Park, Utah Associate Superintendent: "What we have found is the machines score probably more consistently than human scorers."
Which is the correct answer?  Mark Shermis, the researcher quoted in both sections above, offered a study that analyzed results from Automated Essay Scoring competitions sponsored by the Hewlett Foundation in which several AES systems competed against each other.  A public competition later followed, and the results were stunning.  Both private and public systems were able to score at or above the level of humans!  The Mark Shermis study can be found here.

Of greater weight than Shermis' study, are the real-world results that are being seen.  During my time working for an AES vendor, I participated in the development of AES technology that was used as a "2nd reader" for a particular US state.  What this means is that a human reader scored each response and the computer offered a 2nd score.  If the 2 scores were substantially different, then a 3rd human would set the final score.  Our system graded thousands of responses on over a dozen prompts, grade levels, and of varying topics and lengths.  Amazingly, the computer was more accurate than the human readers on EVERY prompt for EVERY grade level.  However, not every project was this successful.  One trial project for a particular country yielded results where the computer was slightly less accurate than the human readers on some traits, although within reasonable measures of error.  Regardless, the message could not be more clear:  Even in it's infancy, Automated Scoring technology is comparable to humans and it is only going to get better.

Why All the Fuss?

From what I've read and the conversations that I've had, the issues and fears about the quality of computer grading stems from two points:

Humans and Computers Grade Differently

Anyone that has proofread a paper has an intuitive feel for how a human grades.  We start from the beginning and read through a paper looking for errors in mechanics.  We breathe in the words and take note of how it makes us feel based on the expressions presented.  We grasp the subtleties (usually) and also take note of how well arguments are made and supported...

Computers share many of the same approaches to grading, but handle some things differently.  While they do scan the paper for errors in grammar and proper usage of appropriate vocabulary, they may estimate other things like logical arguments, by using statistical analyses of the presence of certain word sets, or the similarity of a given response to another response that the computer has already seen graded.  This can make a skeptical audience a little uneasy, but the results show that it works.  If a simpler physics equation can accurately model a complex physical process, would you demand that an exact simulation be used instead?  No.  Similarly, research and real-world usage are showing that AI Scoring systems are every bit as accurate as humans, even if their approach to grading is different.  Furthermore, emulating a human grader has it's drawbacks, as discussed below.  

Humans Do Not Grade as Well as You Think

This one may sting a little bit, but it has to be said.  Let me explain.  You may be picturing a teacher thoughtfully reading over an essay with a red pen -- pausing for a moment to scribble some wise advice in the margins and then continuing on.  This picture is all wrong when it comes to testing.  Graders, called "readers", are given only a few minutes to read an essay and assign a score, usually on a small scale (e.g., 1-6) and they must adhere to a rubric.  Considering these restrictions, they do remarkably well, but we mustn't forget Alexander Pope's poignant observation: "To err is human".  For all the criticism and fears that I've read regarding Automated Scoring systems, I'm amazed at how we hold ourselves in such high-esteem.  I see the same response to the autonomous vehicles being developed.  I get the feeling that people do not realize that machines do not need to be perfect, just better than the comparable human.  Here are just some of the errors that human readers are prone to:

  • Bias.  We are easily influenced by things in the response that should not matter.  For example, something about the writer might remind us of a loved one and that affects the way we score their writing.  That is just one example, but there are many more.
  • Dynamic State = Mixed Output.  We are a complex, chaotic system and this is a nightmare when it comes to scoring.  Computers win when it comes to being rational and consistent.  How is a score affected by a reader who is hungry?  Sad?  Sleepy?  Hungover?  Where is the public outcry over these human "machines" that offer different grades based on their ever-changing internal state?   
  • Drift.  A number of psychologists and behavioral economists have studied the way that humans lack an objective measurement system.  Everything is comparative.  The perceived shade of a color is different when next to a darker vs lighter color.  The length of a line seems shorter if it is next to a longer line.  How well written does an average essay seem after having just read five poorly written essays in a row?
  • Egregious Inconsistency.  The previous two items deal with inconsistency, but this deserves it's own bullet point.  A computer will always give the same output for the same input.  This seems like an obvious and basic prerequisite for any grader; yet, for short answer responses, it is common for human graders to give different scores to the exact same answer.  Let me repeat that, "It is common for human graders to give different scores to the exact same answer."  In fact, I once saw the same answer receive the absolute lowest score from one human reader and the absolute highest score from the other reader.  This seems to me to be the epitome of a poorly designed grading system and yet it is something that is quite common for human readers. 
  • Lack of Precision.  Humans are great at generalizing and connecting relations, but very poor at making calculations in terms of speed and accuracy.  Forcing a human to quickly grade an essay and adhere to a lengthy rubric is simply a mismatch of a human's innate capabilities.  Computers, on the other hand, are quite adept at scanning and processing information, tallying items, counting matches, and making calculations with both speed and accuracy.  This is a key advantage of a computer when a detailed rubric is used and time is limited.  
The point of this section, is not to bash us humans, but to offer a candid view of our flaws, and to help us recognize that combining the different approaches of humans and computers offers us the best path forward.  And this is precisely the approach that Automated Scoring systems are taking.

What About PaperRater?

I would like to end this article with some information on our own FREE Automated Essay Scoring engine affectionately named Grendel.  Grendel is a general scoring system that is not calibrated to specific prompts, such as the systems used in high stakes testing.  It is also designed for speed and limited usage of resources, so the accuracy is below that of a human grader.  Nevertheless, we do plan on offering a more accurate system in the future for premium users.  In the meantime, Grendel offers a general score along with automated feedback on grammar, spelling, word choice, and much more.  We have received hundreds of emails from educators that are using PaperRater to allow their students to receive on-demand feedback before turning their papers in.  My favorite message came from an English teacher who said that PaperRater is the most useful tool that she has used in 25 years of teaching.  We hope you will give it a try!  

Wednesday, July 9, 2014

Even Easier to Use!


One of the guiding principles at PaperRater is to make things simple and painless to use.  No signups, no logins, no payments, no three minute wait for results...  We fancied ourselves rather satisfactory in this regard.  That is, until you told us otherwise.  We were shocked to discover that you do not like squinting at images and typing in crooked letters.  And we were at least a small bit saddened when we heard that you do not share our joy in deciphering blurry house numbers.  So, it is with mixed emotions (and sarcasm) that we officially announce the end of reCAPTCHA for most users of our automated proofreading and plagiarism detection tools.

What exactly does this mean? 
A week ago, all users of our site were confronted with the dreaded reCAPTCHA before submitting text into our automated proofreader or our plagiarism checker.  As of today, it has been removed from these tools.  However...this does not mean that CAPTCHA is completely gone:

  • Other parts of the site may still use reCAPTCHA (e.g., contact form)
  • reCAPTCHA may still be displayed if you are suspected of spamming (either by the content you submit, or by the number of submissions coming from your IP address)
  • We may use a less annoying CAPTCHA in the future (one that is not reCAPTCHA), if needed

Why was reCAPTCHA used in the first place?
CAPTCHAs are used to identify a visitor as a human, rather than a bot.  Bots represent a definite problem for our site because 1) we are free, and 2) our services require a lot of computing power.  In other words, bots cost us more money than they do most sites.  Nevertheless, we have plans to defeat the bots w/o forcing most human visitors to enter a CAPTCHA.

Other news in usability
Perhaps not as celebrated as the end of reCAPTCHA...we have also decided to remove the title field.  For most users, we believe this field is unnecessary and just one more obstacle to a quick and painless submission process.

Thanks for reading this far!

Monday, November 4, 2013

Plagiarism Detection Changes

If you've been a regular user of PaperRater, then you may already be aware that we've been struggling with issues in the plagiarism detection module.  We first ran into problems when using a 2nd-tier search API that powered this feature, but we were able to switch to Google and restore service.  This was great for a time, but led to even worse problems as Google accidentally killed our subscription at one point, and, more recently, they have set a very low limit on API requests, which has caused our plagiarism check to have issues later in the day, while working flawlessly in the mornings.
After temporarily disabling the plagiarism check for the past few days, we are rolling it out again today with the Bing Search API powering it under the hood.  We hope this will yield better uptime, but we have already found bugs with their phrase queries, about which we've contacted them.  Feel free to contact us with any feedback regarding this rollout.  Thank you for the patience you've shown as we continue to work through these issues.  And please continue to spread the word about this free resource.  Our team is working hard to deliver a top-notch product that is accessible to all.  But all funds are currently devoted to development and operations, so we need your help to spread the word!  Linking to our website wouldn't hurt either.  :-)   Thanks!

UPDATE  Nov. 10, 2013:  We received a lot of feedback in the few days after this was posted (thank you!).  We responded to this feedback by making further enhancements to the plagiarism detection, which we released near the end of this week.  Results are not optimal, but the dissatisfaction rate has dropped significantly.  We will continue to rollout other enhancements to the plagiarism checker in the weeks ahead that should help address accuracy.

Wednesday, October 5, 2011

Printable Summary is Here

Most visitors to our site probably don't notice the small black, vertical tab titled "feedback" on the right side of the page.  This humble button wields a lot of power in determining where we focus our development efforts.  It allows users to vote on ideas or features that they would like to see in our service.  And, of course, you are welcome to create a new suggestion here as well.  For quite some time the request for a "Printable Summary Report" garnered many votes, so we are glad to announce that this has been incorporated into the website.  Just look for the "printable summary report" link under the analysis links.  The printable report includes most of the analysis results from the dynamic report (some were removed for brevity).  Thanks for all who requested this feature.  We welcome you to suggest and vote on new features that you'd like to see offered.

Wednesday, December 1, 2010

title validation - how not to write a title

We cleverly wrote our blog post title in all lowercase to highlight the fact that our PaperRater service now includes title validation. 
[loud clapping] 

What does this mean?

In the past, anyone could submit a paper with a terrible title -- too short, too long, not properly capitalized, etc. And yet there would be no word of advice from us.  Today, all this has changed, so beware if you plan on submitting papers with shoddy titles.  We've got our eye on you!

Wednesday, November 10, 2010

Automated Grading has Arrived

Automated grading of papers has been one of the most requested features and certainly the one that has had our engineers working the most hours.  Consider the difficulty in attempting to grade a paper when...

1) You do not know the assignment topic
2) You do not know the recommended length
3) You are a computer with limited knowledge of the meaning of words

Nevertheless, we've found the Auto Grader to be nearly as accurate as human graders for most papers.  We do note that this grade should be considered a partial grade as it incorporates grammar, spelling, word choice, and style, but not the author's arguments, logic, organization, and ideas.  The latter will still need to be examined by a human.

We do hope some day to provide information into the technical side of this service.  However, for now we are busy adding more features that we hope you will love.  If you would like to suggest a feature, please click here.