Today at THATCamp Southeast I helped organize a session (with Andrew Famiglietti from Georgia Tech) called Research Hacks. We brainstormed ways to use technology to enhance research, both at the archive and when examining born-digital sources. After I proposed the session, I had a moment of panic when I realized I didn’t really have any great hacks to offer. Luckily, I had a few hours and the impetus to finally put together some techniques I’d been meaning to investigate.
Like many researchers, I use a camera to take photos of documents during archival research trips. My problem comes when I arrive home with a bunch of photos that look like this:
Ugh. What to do with all these “DSCs”? Here’s a way to convert those images into documents that are actually searchable and usable.
So, first, credit where credit is due: the idea to use Hazel and Automator comes from Shane Landrum, whose excellent talk, “Camera, Laptop, and What Else?: Hacking Better Tools for the Short Archival Research Trip,” I saw at Yale in 2009.
Renaming Photos
Hazel is a Mac utility ($21.95; free trial for 14 days) that keeps an eye on the folders you specify. At your direction, it’ll perform certain actions on those folders. I use it to rename the images in a folder I’ve called “Research Photos.” In this case, I’ve told Hazel to change my image names from the dreaded “DSC000000” to “NLM” (which stands for the National Library of Medicine) and then the date the photos were taken.
That’s already a lot better. Now I have some assurance that if my photos get moved around, I’ll still at least know which archive they’re from.
But now I have a bunch of JPGs. Personally, I prefer PDFs, because then I can run optical character recognition (OCR) on them with Acrobat Pro, as I explain below.
JPGs to one Big PDFs
Please see updated instructions here for turning all your JPGs into one big PDF.
Creating Searchable PDFs
The easiest way to run OCR, or optical character recognition (which recognizes text in your images), is to use Adobe Acrobat Pro. Even if you don’t have it on your own computer, your university probably has it somewhere. But if you can’t get access to Acrobat Pro, Shane outlines a couple of other options, including Evernote and Ocropus.
Once I’ve opened the PDF in Acrobat Pro, I click on “Document,” “OCR Text Recognition,” and then “Recognize Text Using OCR.”
Now I have a document with text that can be copied and searched. It’s really, really dirty, but it’s better than a plain old image.
Getting Your PDF into Zotero
I like to keep all my research PDFs in Zotero, so I create a Zotero item for the PDF. Then I drag the PDF into Zotero, control-click on the PDF, and then click on “Rename File from Parent Metadata.” That gives my PDF the same title as the Zotero item record I just created.
What’s cool is that I can then search for text right from my Zotero search box. See? It found the word “human” in my PDF!
As a couple workshop attendees noted, this is a pretty unwieldy process. I’d love to see someone make a more streamlined piece of software. Jason Puckett pointed out that Zotero Commons actually goes a long way toward this: it can upload your documents to the Internet Archive, which will then run OCR on them for you. But you have to be sure that your documents can be made publicly available.
What do you use to process your research photos?
Thankyou for your helpful post. I would love to use OCR, but the sample page from an old book that I tried did not produce a good result. I am dealing with nineteenth century and early twentieth century books that sometimes have mould spots, are printed on varying quality paper, with text that is not printed as crisply on the page as is done now. I am also photographing without flash – at times in less than ideal light conditions. The end result is that although my images are easily readible for the human eye, they are less than ideal for OCR.
I am sure that you are dealing with similar quality material. Do you find converting to OCR as straight forward as this post implies, or is there a bit more to it?
Thanks for your comment! You’re right, sometimes performing OCR on texts is less straightforward than I made it sound in the post. In fact, sometimes OCR doesn’t work at all on the texts I’m using. I tend to take a some-is-better-than-nothing attitude to OCR. But there are ways to improve the results you get from text-recognition scans by preparing the document you’re working with. I don’t know a ton about it, but this page has what sounds like some good suggestions. They boil down to: get rid of blemishes, change the threshold to get rid of tinting and lighting issues, and convert the image to black and white.
Hope you get some results!
I agree that something is better than nothing. Thankyou for the link.