Without a doubt, at some point in your Internet-browsing life you’ve been making a purchase or filling out a form when a security tool has popped up, asking you to enter a series of words. The words are presented to you with the letters contorted and masked under scratches and spots.
These tools are called CAPTCHA – Completely Automated Public Turing test for telling Computers and Humans Apart. Rudimentarily designed in the late 90s, before taking its current form in 2000 thanks to a team at Carnegie Mellon University, they quickly became the most common safeguard against spammers and ‘bots’ on the net.
The theory behind CAPTCHA is simple (though those who have struggled to tell the difference between a lower-case l and upper-case I might say otherwise): humans, in general, have no issue recognising the characters they have to input. On the other hand, computers can’t tell the difference between the letters and the elements included to obscure them.
By 2011, 200 million CAPTCHAs were being completed every day. Each test takes nine seconds to complete. That equates to 500,000 hours – an incredible 57 years – of an individual human’s processing power produced over a 24 hour period. And that output was going to waste.
Realising this, the team from Carnegie Mellon asked themselves a question: was there some kind of problem that could be broken down and distributed through CAPTCHAs to be solved nine seconds at a time?
So it was that they started on reCAPTCHA, a system that not only verifies a site’s user as a human, but puts them to use digitising books.
Here’s how the process goes:
A book is manually scanned into the reCAPTCHA system, and then digitally sliced into one or two word fragments for use in the authentication procedure.
This fragment is then paired with a traditional CAPTCHA which features a word the system already recognises as correct.
Once the reCAPTCHA is presented to the user, they are asked to type both words. One of them – the word the system knows – is used both for verification and to assess whether the other – the word from the book – has been written correctly.
The new word is sent to a group of users so that the system receives multiple confirmations that the word is right.
Then the cycle continues.
reCAPTCHA was first rolled out in 2007, before being purchased by Google in 2009. By that time, 35,000 books has already been digitised, along with 20 years worth of New York Times publications.
Today, the latter’s entire catalogue from 1851 onwards has been digitised, and Google have expanded the system in order to digitise map data, and improve machine learning.
Beyond the more immediate results, CAPTCHA is giving researchers new insight into the human mind. Thomas Hannagan is a cognitive scientist working on the Human Brain Project at Unicog, France. Working to discover how the human mind recognises words visually, Hannagan hopes to find a way to ‘fix’ the system in those who have difficulty interpreting written words, like those who suffer from dyslexia.
Most recently, Hannagan and team trialled the CAPTCHA system with baboons to decypher whether the ability to translate visual imagery into words derives from an understanding of language, or from the visual system. In a report entitled Deep Learning of Orthographic Representations in Baboons, the findings were clear: “We have used deep learning networks to uncover the nature of the orthographic code which allows baboons to perform word-nonword classifications at a high level of accuracy”.
It’s a revolutionary theory – one of many still working to interpret the orthographic system in humans – and one for which we have CAPTCHA to thank.