August 18, 2008

Old Texts Make New Web Security Possible

Many Web sites are benefiting from weathered books that are being digitized in order to create a secure surfing environment.

The new anti-spam campaign improves upon the method of forcing visitors to transcribe obscured words or characters before they get access to Web sites. Now, many sites are using text from old books and documents that have been scanned by character reading software.

The text can be read by humans, but the software is unable to recognize them.

These so-called Captchas (Completely Automated Public Turing test to tell Computers and Humans Apart) are widely used by Web sites to stop spammers from exploiting them to harvest information.

It is estimated that Captcha schemes are used about 100 million times every day.

Now, the new evolution in Captcha, named Recaptcha, uses words that optical character reading software has marked as unreadable by computers.

Luis von Ahn at Carnegie Mellon University in Pittsburgh created the Recaptcha project.

In some documents, where ink has faded and paper has yellowed, the character reading software can flag up to 20% of words as indecipherable.

These hard-to-read words are sent to sites along with a control word that aims to ensure the person answering is human.

Reporting in the journal Science, the Recaptcha team says the scheme is about 99.1% accurate - as good as professional transcribers and beyond the limit demanded by archivists

About 40,000 sites have signed up to use words supplied by Recaptcha and it now collects about four million responses every day.

In the last year it has helped resolve more than 440 million words and has just helped to complete the conversion of the entire archive of the New York Times from 1908 into digital form.


On the Net: