Using Human Computation and reCAPTCHA to Digitize Old Books, with Luis von Ahn


So human computation, the idea, is that there
are problems that computers cannot yet solve. It’s funny because some of these problems
are very simple problems seemingly. For example, a computer cannot tell you what’s inside an
image. They can tell you somethings but it can’t really quite tell you there’s a cat
next to a dog and they’re both running. A computer can’t do that. Well humans, we can
do it super easily. And there are many things that computers cannot do that humans can.
Conversely, there are also things that computers can do that humans can’t do. I mean computers
can multiply humongous numbers, humans may be able to do it but very slowly and we’re
error-prone. And so the idea with human computation is to combine both humans and computers together
in a very large scale to solve problems that neither can solve alone. My project that has been used by most people
is a project called reCAPTCHA, where the Idea with reCAPTCHA is that we take a problem that
neither humans nor computers can solve by themselves, which is fully digitizing books.
The idea there is we would like to digitize books. And the way this process works is you
start with a book and then you scan it. The next step in the process is that the computer
needs to be able to decipher all of the words in this picture. It’s a picture of words.
The computer needs to be able to decipher all of those words. The problem is that sometimes
the computer cannot decipher these words because for older books the ink has faded a little
or the pages have turned yellow so the computer cannot decipher all of the words. But, humans
can. So what we’re doing with reCAPTCHA, If you’ve ever seen these distorted letters that
you have to type all over the Internet, for example, when you buy tickets on Ticketmaster
or whenever you get a Facebook account or something you have to type these distorted
characters. That thing is called a CAPTCHA and I was one of the people who helped invent
it. And the reason it’s there, there’s a primary purpose, which is to make sure that you’re
a human and not a computer. And it’s because humans can read these squiggly characters
but computers can’t. This is a security mechanism and it has been there for a while, but at
some point I realized its second use, which is helping to digitize books. The idea is
that some of these words, nowadays some of these words are words that are actually coming
from books that the computer could not recognize in this process and we’re using what people
enter to help us digitize the books. So that’s the idea. And so this is a project where it’s about
1.1 billion people in the world have helped us digitize at least one word out of a book
using this. So here we’re taking a very large number of humans to do precisely the step
that computers cannot do in the book digitization process. This is a company that was bought
by Google, by now Google is digitizing the equivalent of about 2 million books a year
with basically humans typing every now and then some of the words through CAPTCHAs all
over the Internet. So that’s the idea of human computation.

91 Replies to “Using Human Computation and reCAPTCHA to Digitize Old Books, with Luis von Ahn”

  1. What you said: Humans working together with computers to create a better future.
    What everyone heard: Humans being put to work as free labor without even knowing it.

  2. When reCAPTCHA was implemented on 4chan to curb spam, it was hated by /b/, anonymous gamed the system by typing offensive slurs instead of the real word.

  3. CAPTCHA is still the stupidest thing on the internet… it might be nice for getting books online using crowd work, but it's annoying as fuck to do every single post in forums. Especially when you can't read the shit in the image.

  4. I don't understand…

    You must input the correct word for the captcha to work, which means they must already know the word in order to check if its the correct one…

  5. I have an issue with reCAPTCHA and it's not that it's free labour, I'm all for that they help to digitize books… my problem with the system is that since I'm a dyslectic it's really hard for me to get the words correct sometimes in some cases it's taken me 4+ tries to get past the reCAPTCHA 🙁

  6. Sorry for this stupid questio, but how does it know you're typing the right word to confirm you're human, if we're the ones typing the words computer can't decipher? Is it being confirmed by million others who type the same word? But the, how does this work, because it needs to see you've typed the correct word first in order to allow you to continue.

  7. It benefits society ( I guess) but it benefits google too because not only do they not have to hire workers to manually digitize books, they can just get people over the internet to do it for free. Hmmm.

  8. Fuck you and your Captcha. Those shitty things either don't recognize what you type or give you an image so distorted it's unreadable. And then it asks you to try again. Fuck your shitty time-wasting security

  9. If youre using CAPTCHA as a security check, how on earth are you at the same time using it to have people decipher you the books.  Since to use it as a security check, you first need to know the answer.  You're not making sense.

  10. …and this is why I type obscenities every time I get one of these.  If you want my help, you fucking ask for it.  Don't co-opt my brain for your purposes without my consent.  Fuck you.

  11. Does this man get kicked in the nutsack a lot? He needs to be kicked in the nutsack as often as possible. Yes, the new idea of scanning books for digitization is brilliant. However, his Captcha sucks so much ass that he should never be allowed to use a computer again. I've never known a virus to be anywhere near as frustrating as this guy's idiot idea for force people to decode completely illegible 'words', or be stuck on whatever page they are on. What's worse, is when, not if you get it wrong, you don't get to try again, you get a brand new hieroglyph to decipher.

    This is the most retarded product ever made.

  12. How about we don't get angry that Google is making us type ONE word for them, while they give us amazing services for FREE. The fact you learned this information is thanks to Google. I'm sure if someone came up to you and said "Hey we want to digitize books so that they can be accessible by anyone and preserved forever. (and you can download them for free you cheap bastard), will you help us type 100 words during your lifetime, without pay?"

    I'm sure you could put aside the time to do this.

    Should probably be THANKING them for genius idea.

  13. Here's the issue I have with this, maybe someone can help me out:
    As he stated, reCaptcha is primarily a security system designed to ensure that online transactions or whatever are not being performed by automated computers. You type in the word you see in the "picture" on the screen, and the system lets you proceed if you get the word right. Therefore, someone had to have already determined a correct answer to that "picture". How then does my answer help them decipher new words if there is already supposed to be a correct answer, and therefore a word, associated with each picture?

  14. So I guess Google just solved this whole problem with the Google's Image Recognition Software. http://gizmodo.com/googles-image-recognition-software-can-now-describe-ent-1660033808

  15. Great idea but fuck me if I don't HATE those captcha things. Also, how come the computer knows if I've typed the captcha correctly if it doesn't know what it is to begin with? If it already knew, why do I need to decipher it? (taking security out of the equation for a moment)

  16. 1.The computer doesn't know if it recognized the word properly or not
    2.If computer doesn't know the word, then you can enter anything in captcha and it will pass, so it loses the purpose of the captcha.

  17. Computers could recognize digitized letters the same way people can, by using knowledge of the language and context to figure out what the word should be. There are things that are hard for computers to do, but reCAPTCHA shouldn't be one of them. 

  18. out of curosity… how does this help fill in books? in order to use the word to enter a website (or whatever) the computer has to already know the answer or it does not work. you cant ask a question for security purposes without knowing the answer. So how does this help at all?

  19. Thats pretty awesome… except when I can't read the capta and have to take a guess.  Those poor books, gonna have L's, I's, and 1's all mixed up.

  20. and this just shows how you can use anything and implement it to spy on people …… this CAPTCHA is a good example even though its not used to spy but still creepy the fact it grabs information like that now when its purpose was something much different   ……….. 😛

  21. Very cool. One thing I don't really understand is when you get a word wrong and CAPTCHA changes the word, how does CAPTCHA know that you got the word wrong if the program needs your help to decipher it to begin with?

  22. this doesn't add up. if they were really using the words I typed, then even my incorrect inputs would work. and so far none of them have 

  23. Please, if anyone knows an answer to these questions?
    What if the human entered the word wrong in the box?
    How do they know that you entered the word correctly? shouldn't they have the answer to compare it to yours ?
    Do they put the same picture around the world and compare the results from different people to confirm what is right?

  24. Cool system, but I have some questions. If the computer can't detect the word, how does it know if you're right? Does it take the most common response?

  25. Wait… if they upload words from books for captcha so that we can help identify what really these words could be. Then how does captcha select the correct answer when they don't know and ask for us? And to be honest those words are not that difficult.

  26. CAPTCHA is the worst invention in the history of authentication. I don't even know wth it shows 80% of the time, it's incredibly annoying. Of course it's a good invention overall, I'm obviously exaggerating. But I'd rather use a Fob.

  27. How do they know that the captcha is correct if the computers cannot solve the problem? I always thought it would have an answer as a key-value pair.

  28. I think the 'Captcha' everyone is so frustrated with are those old random gibberish letters a lot of times smashed in together and strike-through'd, really annoying. The reCaptcha is something easier being actual words, its like one arrow for two targets plus I personally think it's for a good secondary cause.  

  29. Brilliant! and simple, as are all the best ideas. These guys think outside the box . Now when I do a captcha I'm not wasting my time.  

  30. Don't you need to know the real answer before entering the captcha answer as correct?
     My question is: doesn't someone have to go over the answers to make sure they are correct before sending them out to test people in the first place?

  31. yes this system works. But you need to except 1 or even 100 problems, because this system only show you words that are written in English or using alphabet that is based on Latin alphabet. Well there are many books in other languages. What about these books?

  32. Luis Von Anus. Looks like an anus, thinks like an anus and makes software like an anus. "Thank you" very much for that stinking shit called recaptcha, very "useful", indeed.

Leave a Reply

Your email address will not be published. Required fields are marked *