Monday, September 21, 2015

Anatomy of a Plagiarism Checker


If you've ever wondered how our plagiarism checker works on the inside or what our originality score means, then this article is required reading.  The green plus icon and the "100% originality" are a wonderful reassurance for writers that submit their work to our service, but what does it mean?  Similarly, if you receive an Originality of 70%, should you be concerned? Each of these questions will be answered as we take a look under the hood of our plagiarism detector.


A Daunting Task

When we say that we are checking for plagiarism, we are attempting to discover if portions of the given text might have been taken from other previously written texts.  Fifty years ago a teacher may be concerned that a work was taken from a book, or perhaps from another student in the same class or from a class years earlier.  Today, there are many reasons to want a measure of a text's originality, and this task is as daunting as ever.  While our ability to process text has improved dramatically from the scenario fifty years ago, so has the availability of text to would be plagiarizers.  In fact, the text that is publicly available on the Internet already exceeds trillions of pages and continues to grow exponentially.  The task of plagiarism detection is all about finding the proverbial needle in the haystack.  

How It's Done

For the curious, precise details of storing and rapidly searching massive amounts of text can be found under the field of Information Retrieval.  But for our purposes, high level details will suffice.  As stated earlier, we wish to search trillions of documents efficiently, so we turn to the companies that already do exactly this -- search engines.  Google, Microsoft, and Yahoo maintain the software and thousands of computers necessary to track, store, and search the massively growing index of Internet content.  They offer to us the ability to search their content via an API.  By using their search APIs, we tap into their vast data stores without the overhead of attempting to crawl the entire Internet ourselves.  

Specifically, when a document comes into our plagiarism detection service, we chop it up into small snippets of text and run a sample of those snippets through the search APIs. Consider the following snippets pulled from a paper on Abraham Lincoln:

  • Lincoln grew up on the western frontier in Kentucky
  • confronted Radical Republicans, who demanded harsher treatment of the South
  • remaining land he held in Kentucky in
  • became an able and successful lawyer with a reputation as a formidable
  • compensation for the owners, enforcement to capture fugitive slaves
  • ....
Imagine pulling 100 snippets from this document and then running a Google phrase search on each of these snippets.  How many of the 100 snippets would match a document on the Internet?  Since these excerpts of text come from the Wikipedia page, we would expect all (or nearly all) of them to have at least that page in the search results.  If these excerpts came from a completely original source, then we would expect all of the search results to come back with no matches (or perhaps a few false positives).  This is approximately the approach taken by our plagiarism detection service.  The originality score that you receive is represented by this simple formula:

1 - (Number of Searches with Matches / Total Number of Searches)

According to this formula, the originality is 0% when all of the searches have matches, which is exactly what we expect. Now this is a simplified overview of what is actually a much more complicated process, but it conveys a general appreciation of the methodology used at PaperRater.


Checking Against Past Submissions

One question we receive from time to time is whether past submissions are used in calculating the originality score.  The answer is 'No', but this deserves an explanation.  Sites like TurnItIn bank previous submissions and check against these in addition to using search APIs.  This creates concerns for false positives as well as privacy that we would rather avoid.  Imagine submitting an original paper to our service before you turn it in and then being accused of plagiarism when your teacher checks it with the same service one week later.  Rest assured that PaperRater checks papers using only the search APIs.

No comments:

Post a Comment

All comments are reviewed by the moderator BEFORE they appear on this page. Spam will be deleted, so don't waste your time or my time.