Putting CAPTCHAs to good use

Much of the info in this post is from an Associated Press article I read at CNN: Web registration tool digitizes books

So y’know those CAPTCHA things? Where you’re registering for a website, or adding a comment to a blog, and you have some squiggly letters that you have to type in to prove that you’re a human and not a bot?

Well, in a quasi-twist on Folding@Home and other distributed computing applications, the folks at Carnegie Mellon University are working on a way that will put your CAPTCHA typing to good use. Let’s call it a distributed keyboarding application. 🙂 They’ve estimated that 60 million CAPTCHAs are typed in every day, at an estimated 10 seconds per CAPTCHA. Do the math and it come out to 166,667 hours/day spent typing these things in.

Meanwhile, over at the Internet Archive they’re busily scanning images of book pages for import via OCR. But Luis von Ahn, assistant professor of computer science at Carnegie Mellon and one of the developers of the CAPTCHA system, says that some books can’t be read by OCR systems, due to their age or the condition of the text.

So the new idea is to scan in the pages of these books, use software to break those images up into many tiny images each containing a word, and using these images as the CAPTCHA ‘test images’. Track the results as users type in the resulting word, and when enough of them agree, the computer accepts that this particular image represents this particular word. Over time, the text of an unscannable book will be rebuilt by people registering for web sites. They’re calling this the “reCAPTCHA” system.

And that’s where the article leaves off, but I’m still trying to figure out how this would work. If I’m sending out these unscannable images, how does the registration system know the user is typing in the right word? My best guess is that the article is wrong and the images aren’t of single words, but of pairs of words, one of which has been deciphered (or more likely, the CAPTCHA displayed to the user is 2 ‘words’ long, one of which is provided by the CAPTCHA system and the other is the unknown word). The ‘Turing test’ to see if it’s a real person only uses the first word. The second word is used by this new system to try to scan in books. If this is the case, we’re not really harnessing energy already being expended, but instead adding to the work done by CAPTCHA users.

The only other system I can imagine is one where the CAPTCHA input is sent back to a central database in real time. As a new word/image goes out, it lets everyone in…the input test is in effect a bluff since there’s no data on what word the image represents. After, say, 500 people have responded to that word/image, the system starts to get a good idea of what the word is. At least it’ll be seeing some common letter positions at that point, and then it can start doing a pass/fail on the input from the user. Of course, using this method, a system that gets a ‘fresh’ image from the reCAPTCHA system isn’t really being protected from bots or spammers. On the other hand, the bot/spammer doesn’t know its a fresh image. (Do bots & spammers even try to spoof CAPTCHA systems, I wonder?)

So, assuming that the much-smarter-than-me people at Carnagie Mellon haven’t come up with a better system, the new reCAPTCHA system either adds a bit to the workload of CAPTCHA users, or it slightly compromises the security of the systems using it. But in either case the drawbacks are pretty minimal, and the good work being done is pretty significant. I’m looking forward to the day the system gets put into practical use!

Review: Netgear’s Dual-Mode Skype Phone

My most recent published article. Computerworld edits to present a consistent voice, which means reading it doesn’t really sound like me, but since I haven’t had much to post at the blog lately I figured I might as well mention it.

Review: Netgear’s elegant VoIP/land-line hybrid phone

I was pretty impressed with this phone. If I didn’t already have Vonage I’d consider signing up for a year of “SkypeOut” calling and do away with my landline. If you haven’t tried Skype, well, it really rocks. It makes services like TeamSpeak and Ventrillo seem like child’s toys.

What a gaffe!

Yesterday this was posted at “The Unofficial Apple Weblog”:

If you’re feeling overwhelmed from the onslaught of YouTube forwards, newsreader headlines, Miniclip games and software demos we tirelessly blog for you, Procrastinatr just might be your solution. Even though it’s only a 0.8b version, this handy little app can help you make molehills out of mountains and start managing your time again.

Turns out, the app in question was a piece of malware!! A few hours later the post was amended:

TUAW readers: I sincerely apologize for the damage that Procrastinatr did to iCal. I didn’t notice any discrepancies in my calendar after trying this out (as almost all of my calendars are synced from Google Calendar), but please know that I have learned my lesson, and I will take much better care in the future before posting anything like this again.

I’d love to totally hack on the guy since TUAW is part of the Weblogs, Inc. Network family of blogs and I dislike that group. But working for a news publication, I see how hard everyone pushes to be first to announce any little bit of news, or in this case a new product. Best coverage doesn’t seem to be as important as First coverage. I don’t understand this attitude… most web surfers have their favorite sites to visit…they’re not constantly rotating through sites to see which one gets the story up five minutes ahead of the others. *shrug* But I’m just a reader…how should I know what readers want?

Anyway, this frantic scramble to publish first leads to problems like this one. How much trust has this person lost now? I know I won’t be first to try any software TUAW links to in the future!

Are you an Internet Addict?

Looks like modern medicine is catching up to what MMO players have known for years: that some people become addicted to the internet. *turns the mirror to face the wall*

Growing concern over Internet addiction

Not to make light of the situation, because I do know people who have lost themselves online, but this paragraph makes the whole article sound like a late April Fool’s joke:

Internet addicts may also get the “cyber shakes” when off line, exhibiting agitation and typing motions of the fingers when not at the computer.

I can honestly say that no matter how far gone people get, I’ve never seen anyone making ‘typing motions’ as described here. Now I’ve seen lots of people addicted to various substances and behaviors drum their fingers due to being restless and anxious. But that has nothing to do with it being like typing.