Much of the info in this post is from an Associated Press article I read at CNN: Web registration tool digitizes books
So y’know those CAPTCHA things? Where you’re registering for a website, or adding a comment to a blog, and you have some squiggly letters that you have to type in to prove that you’re a human and not a bot?
Well, in a quasi-twist on Folding@Home and other distributed computing applications, the folks at Carnegie Mellon University are working on a way that will put your CAPTCHA typing to good use. Let’s call it a distributed keyboarding application. 🙂 They’ve estimated that 60 million CAPTCHAs are typed in every day, at an estimated 10 seconds per CAPTCHA. Do the math and it come out to 166,667 hours/day spent typing these things in.
Meanwhile, over at the Internet Archive they’re busily scanning images of book pages for import via OCR. But Luis von Ahn, assistant professor of computer science at Carnegie Mellon and one of the developers of the CAPTCHA system, says that some books can’t be read by OCR systems, due to their age or the condition of the text.
So the new idea is to scan in the pages of these books, use software to break those images up into many tiny images each containing a word, and using these images as the CAPTCHA ‘test images’. Track the results as users type in the resulting word, and when enough of them agree, the computer accepts that this particular image represents this particular word. Over time, the text of an unscannable book will be rebuilt by people registering for web sites. They’re calling this the “reCAPTCHA” system.
And that’s where the article leaves off, but I’m still trying to figure out how this would work. If I’m sending out these unscannable images, how does the registration system know the user is typing in the right word? My best guess is that the article is wrong and the images aren’t of single words, but of pairs of words, one of which has been deciphered (or more likely, the CAPTCHA displayed to the user is 2 ‘words’ long, one of which is provided by the CAPTCHA system and the other is the unknown word). The ‘Turing test’ to see if it’s a real person only uses the first word. The second word is used by this new system to try to scan in books. If this is the case, we’re not really harnessing energy already being expended, but instead adding to the work done by CAPTCHA users.
The only other system I can imagine is one where the CAPTCHA input is sent back to a central database in real time. As a new word/image goes out, it lets everyone in…the input test is in effect a bluff since there’s no data on what word the image represents. After, say, 500 people have responded to that word/image, the system starts to get a good idea of what the word is. At least it’ll be seeing some common letter positions at that point, and then it can start doing a pass/fail on the input from the user. Of course, using this method, a system that gets a ‘fresh’ image from the reCAPTCHA system isn’t really being protected from bots or spammers. On the other hand, the bot/spammer doesn’t know its a fresh image. (Do bots & spammers even try to spoof CAPTCHA systems, I wonder?)
So, assuming that the much-smarter-than-me people at Carnagie Mellon haven’t come up with a better system, the new reCAPTCHA system either adds a bit to the workload of CAPTCHA users, or it slightly compromises the security of the systems using it. But in either case the drawbacks are pretty minimal, and the good work being done is pretty significant. I’m looking forward to the day the system gets put into practical use!
I remember years ago when the government office I worked for started using the OCR and it was difficult with documents, much less books.
I’d have to hop over and read the article but your theory on this creating more work for CAPTCHA users seems to make sense. I must not think far enough ahead because I am still used to the notion of paying people to do data entry. (a job i once held) I understand all of the advanced programming now is supposed to pay off much further down the line. But, I don’t like the idea of security compromises for the sake of saving time. I’m a dinosaur, I know.
And HEY! I love your new banner here. Did you create it?