Monday, February 19, 2018

Recaptcha

A few months ago, I was voting for a charity in Texas to give it a chance to win a monetary award. In doing that, the website had a "reCaptcha" to ensure that the results were not impacted by bots (computerized robots) voting.
You may have seen these before.
I got to wondering about these items and how they worked... you see, the website allowed me to vote as many times as I wanted for the charity of my choice.
Initially, it would process a few moments, and then give me a check mark and I could cast my vote.
But, after about 10 votes, it would start asking me to do things:
Sometimes it would have me select which squares of a picture contained something specific that it was looking for... 

In this case, street signs - I've marked the ones that I think are street signs. By the way, I was wondering - what is a "street sign"? Sometimes the picture has a commercial sign (like a billboard) - is that a street sign? What about the street name signs? See the signs that are kinda fuzzy at the bottom left of this image? Are they street signs?

This is another type - in this case, a 9-pack of pictures out of which I am to pick something specific. On the computer, this is not too bad, but on a smartphone, I am hard-pressed to see what is in any of the pictures!

And this is another type - in this case, I was to select all images with "roads" - but instead of selecting and being done, once I had selected a picture, it would disappear and display a new picture - which may be a road and may not... so I had to check to see if all the remaining pictures had no road.
Sometimes when it gave me a picture challenge, I only had to do one picture challenge, and then I would get the green check and be able to cast my vote. Other times, it required me to do two picture challenges (I can't remember ever having to do more than two). I wondered if I got an extra challenge in a circumstance when it wasn't sure whether I was human or not based on my first responses... so I decided to do some research into how this technology works.

Wikipedia told me that reCAPTCHA
is a CAPTCHA-like system designed to establish that a computer user is human (normally in order to protect websites from bots) and, at the same time, assist in the digitization of books. reCAPTCHA was originally developed by Luis von Ahn, Ben Maurer, Colin McMillen, David Abraham and Manuel Blum at Carnegie Mellon University's main Pittsburgh campus. It was acquired by Google in September 2009. https://en.wikipedia.org/wiki/ReCAPTCHA

What is a CAPTCHA?
CAPTCHA (an acronym for "Completely Automated Public Turing test to tell Computers and Humans Apart") is a type of challenge-response test used in computing to determine whether or not the user is human. https://en.wikipedia.org/wiki/CAPTCHA
And what is a Turing test?
The Turing test, developed by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation is a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel such as a computer keyboard and screen so the result would not depend on the machine's ability to render words as speech. If the evaluator cannot reliably tell the machine from the human, the machine is said to have passed the test. The test does not check the ability to give correct answers to questions, only how closely answers resemble those a human would give. https://en.wikipedia.org/wiki/Turing_test
So, back to reCAPTCHA - did you notice that it said it assists in the digitization of books? The Wikipedia article went on:
reCAPTCHA has completed digitizing the archives of The New York Times and books from Google Books, as of 2011. The archive can be searched from the New York Times Article Archive, where more than 13 million articles in total have been archived, dating from 1851 to the present day. Through mass collaboration, reCAPTCHA was helping to digitize books that are too illegible to be scanned by computers, as well as translate books to different languages, as of 2015. https://en.wikipedia.org/wiki/ReCAPTCHA
Further information on its operation:
Scanned text is subjected to analysis by two different optical character recognition programs – one of them, as mentioned the project developer Ben Maurer, is ABBYY FineReader. Their respective outputs are then aligned with each other by standard string-matching algorithms and compared both to each other and to an English dictionary. Any word that is deciphered differently by both OCR programs or that is not in the English dictionary is marked as "suspicious" and converted into a CAPTCHA. The suspicious word is displayed, out of context, sometimes along with a control word already known. If the human types the control word correctly, then the response to the questionable word is accepted as probably valid. If enough users were to correctly type the control word, but incorrectly type the second word which OCR had failed to recognize, then the digital version of documents could end up containing the incorrect word. The identification performed by each OCR program is given a value of 0.5 points, and each interpretation by a human is given a full point. Once a given identification hits 2.5 points, the word is considered valid. Those words that are consistently given a single identity by human judges are later recycled as control words. If the first three guesses match each other but do not match either of the OCRs, they are considered a correct answer, and the word becomes a control word. When six users reject a word before any correct spelling is chosen, the word is discarded as unreadable.
It also included this criticism:
Some have criticized Google for using reCAPTCHA as a source of unpaid labor. They say Google is unfairly using people around the world to help it transcribe books, addresses, and newspapers without any compensation.
I think it is rather clever to allow the use of a facility that helps to prevent computer robots from hacking systems to also help with the digitization of books and articles!

Google's website indicates:
reCAPTCHA is a free service that protects your website from spam and abuse. reCAPTCHA uses an advanced risk analysis engine and adaptive CAPTCHAs to keep automated software from engaging in abusive activities on your site. It does this while letting your valid users pass through with ease. 
reCAPTCHA offers more than just spam protection. Every time our CAPTCHAs are solved, that human effort helps digitize text, annotate images, and build machine learning datasets. This in turn helps preserve books, improve maps, and solve hard AI problems. https://www.google.com/recaptcha/intro/android.html
Other information on their website indicated that the reCAPTCHA image annotation helps for google map users (of which we are very frequent ones!). So... I guess my validation using reCAPTCHA that I am human ends up helping me as a human with google map content as well!

No comments:

Post a Comment