Crowdsourcing Digitization Projects

Computers and search technology have changed the way we process text. Google Drive’s search is so good that it can extract key information from a scanned purchase receipt. While computer text has been growing rapidly in the last several years, we must not forget the millions of pages of hand and type writer generated text that is sitting in archives and special collections. This text is a historian’s gold mine. I fondly tell my history students that visiting archives, seeing, touching, smelling and reading those archival documents is part of the joy of being a historian. Not everyone has the luxury of visiting archives. Should that information continue to be kept beyond their reach? What if all this information is scanned and made easily available on the internet? Scanning is only the first step in making this information useful to the general public. I have some class mates from college who love the philosophy of Jeremy Bentham. Would there be willing to pick up their iPads and read Bentham in his eighteen century handwriting? I do not think so. There are over 40,000 Jeremy Bentham’s manuscripts digitized and available online. These are in Bentham’s own handwriting. Wow! I love Bentham’s philosophy but I am not sure I have that much love for it to spend an hour just trying to decipher one page of his works. While I may not want to spend that much time reading just a page of Bentham, I am more willing to spend it helping with the transcription. This is where crowdsourcing comes in.

I am convinced that there are thousands of people around the world who may not want to spend an hour trying to read a single page of Bentham but are willing to spend an hour trying to transcribe Bentham’s work so that it is more accessible to people interested in Bentham. As of this past week, hundreds of volunteers have translated 14,266 of those manuscripts. All this is possible because the folks at Transcribe Bentham have made it possible and easy for anyone to create an account and begin transcribing Bentham. Transcribing Bentham is one of several such projects that require volunteers from the general public to help. Other examples are Trove, Papers of the war Department (PWD) and Building Inspectors. All of these projects do not require any technical expertise. When people choose to engage these projects, it is partly because of their own research interests and their belief that they are contributing to something bigger than themselves. That is how I felt when I found myself transcribing Bentham and editing an Australian newspaper using Trove. The satisfaction did not come from any form of acknowledgment but I felt that I have to give back to society for what it has given to me. I have been a beneficiary of many archival documents that others have painstakingly digitized and edit. Editing on Trove or transcribing Bentham is my own little way of contributing to the spread of knowledge.

Before feeling that these projects may be too complicated for you, they are not. The architects of the projects deliberately make them easy so that the non-techie person can contribute. I bet you that my dad who is 79 years old and has never been a computer guy will find them really fascinating. In fact, I am planning on introducing him to all four projects. I think he would be hooked. For Transcribe Bentham and PWD projects, the transcription platform is MediaWiki. If you are familiar with Wikipedia (most people are), then you wouldn’t be a stranger to the platform. Basically, there two windows to it: the image viewer (where you view the digitized original image) and the transcription form (where you enter your transcription.) For you perfectionists out there who are so worried about making a mistake in your transcription that you may not want to do it, you need not be worried. It is okay if you make a mistake. Your transcription is not final. It is vetted and corrected by other users and an administrator or moderator also has to approve it before it is locked and recorded.

Trove is different from the transcription projects but the principles are the same. In Trove, you are editing text of newspapers that have been scanned. OCR software has come a long way but it is still not that great. It still makes some mistakes, most especially when you are scanning a newspaper that is 100 years old and the text quality is not that great. Some times it adds its own punctuations that are not in the original text, or it confuses “e” for “o.” The digitized articles I looked at on Trove were very readable. My task was to do a line by line text correction of the OCR text. I actually found it really engaging, perhaps because I enjoyed reading the story I was editing.

Crowdsourcing projects can be more engaging than what I am seeing right now. Over time, many people fall off and never return and just a tiny minority does more of the work. How do you get people back? The studies are clear that most people who are engaged in these activities do because they want to contribute to something bigger than them. However, I think there is need for a reward system. Human beings love rewards. I have a fitbit that I wear religiously. I wear it because I want to track my health and live healthy. When every now and then I receive a notification from fitbit telling me my accomplishments, it feels really good and motivates me to even do more. How good it felt when only a few months after using fitbit, I got the Italian Badge (I have walked around the whole of Italy). It made me want to walk the whole world. These are badges I do not share with anyone but merely being awarded them incentivize me to take more steps. Human beings like incentives and they want to feel like they have accomplished something. Perhaps creating levels for people will incentivize them to return to the projects and hopefully over time, they will get hooked and never want to fall below their level.

There is a lot of digitized text that needs help and is crying for our attention. Each one of us can do our own little part to make this easily accessible. You may not be able to transcribe a whole page of Bentham, you sure can transcribe a sentence.

Comments Off on Crowdsourcing Digitization Projects

Filed under Digital Humanities