Putting CAPTCHAs to good use

Much of the info in this post is from an Associated Press article I read at CNN: Web registration tool digitizes books

So y’know those CAPTCHA things? Where you’re registering for a website, or adding a comment to a blog, and you have some squiggly letters that you have to type in to prove that you’re a human and not a bot?

Well, in a quasi-twist on Folding@Home and other distributed computing applications, the folks at Carnegie Mellon University are working on a way that will put your CAPTCHA typing to good use. Let’s call it a distributed keyboarding application. 🙂 They’ve estimated that 60 million CAPTCHAs are typed in every day, at an estimated 10 seconds per CAPTCHA. Do the math and it come out to 166,667 hours/day spent typing these things in.

Meanwhile, over at the Internet Archive they’re busily scanning images of book pages for import via OCR. But Luis von Ahn, assistant professor of computer science at Carnegie Mellon and one of the developers of the CAPTCHA system, says that some books can’t be read by OCR systems, due to their age or the condition of the text.

So the new idea is to scan in the pages of these books, use software to break those images up into many tiny images each containing a word, and using these images as the CAPTCHA ‘test images’. Track the results as users type in the resulting word, and when enough of them agree, the computer accepts that this particular image represents this particular word. Over time, the text of an unscannable book will be rebuilt by people registering for web sites. They’re calling this the “reCAPTCHA” system.

And that’s where the article leaves off, but I’m still trying to figure out how this would work. If I’m sending out these unscannable images, how does the registration system know the user is typing in the right word? My best guess is that the article is wrong and the images aren’t of single words, but of pairs of words, one of which has been deciphered (or more likely, the CAPTCHA displayed to the user is 2 ‘words’ long, one of which is provided by the CAPTCHA system and the other is the unknown word). The ‘Turing test’ to see if it’s a real person only uses the first word. The second word is used by this new system to try to scan in books. If this is the case, we’re not really harnessing energy already being expended, but instead adding to the work done by CAPTCHA users.

The only other system I can imagine is one where the CAPTCHA input is sent back to a central database in real time. As a new word/image goes out, it lets everyone in…the input test is in effect a bluff since there’s no data on what word the image represents. After, say, 500 people have responded to that word/image, the system starts to get a good idea of what the word is. At least it’ll be seeing some common letter positions at that point, and then it can start doing a pass/fail on the input from the user. Of course, using this method, a system that gets a ‘fresh’ image from the reCAPTCHA system isn’t really being protected from bots or spammers. On the other hand, the bot/spammer doesn’t know its a fresh image. (Do bots & spammers even try to spoof CAPTCHA systems, I wonder?)

So, assuming that the much-smarter-than-me people at Carnagie Mellon haven’t come up with a better system, the new reCAPTCHA system either adds a bit to the workload of CAPTCHA users, or it slightly compromises the security of the systems using it. But in either case the drawbacks are pretty minimal, and the good work being done is pretty significant. I’m looking forward to the day the system gets put into practical use!

To Say Nothing of the Dog

To Say Nothing of the DogI’ve been quite lax, to say the least, in my blogging. I finished Connie Willis’ To Say Nothing of the Dog quite some time ago, and never reported in. And in fact since then I’ve reread The Hobbit but I won’t be reviewing that here since I’ve read it so many times that there’s no way I could do approach it with any resemblance of objectivity. But anyway, back to the Dog.

The only other Connie Willis book I’ve read is The Doomsday Book which was about a future historian time traveling back to research The Black Death. I read it quite some time ago but I remember it as being rather somber, as the topic would suggest. In To Say Nothing of the Dog we follow another time traveling historian but this time out the tone is distinctly light-hearted.

The title here is a tribute to Jerome K. Jerome’s Three Men in a Boat: to Say Nothing of the Dog!, published in 1889. This is the story of, well, three men and a dog taking an excursion along the Thames. The hero of Willis’s book, Ned Henry, also ends up in a rowboat on the Thames and actually encounters Jerome’s trio.

And I’m telling you nothing about the actual book, am I? Aye, I’m a bit rusty.

Anyway, Ned Henry, historian, has been doing too much time traveling of late, resulting in a bad case of ‘time lag’ which leaves him generally confused. He is sent back to Victorian England for some R&R, but immediately gets caught up with ever more convoluted and silly adventures when he does so. Watching him trying to navigate the social customs of the times while trying to keep up with the hustle and bustle of the upper class without doing anything to corrupt the time stream becomes more and more funny as the book goes on.

Yeah, well, it’s been a good while since I finished it…so I’m doing a lousy job of explaining it. But I will say I really enjoyed it and plan on looking for more of Ms. Willis’ novels. She captures the feel of these historical times so … well, I was going to say accurately, but how should I know what it really felt like to be rowing down the Thames in 1889? But it *feels* accurate, and that’s good enough for me!