Today at THATCamp Southeast I helped organize a session (with Andrew Famiglietti from Georgia Tech) called Research Hacks. We brainstormed ways to use technology to enhance research, both at the archive and when examining born-digital sources. After I proposed the session, I had a moment of panic when I realized I didn’t really have any great hacks to offer. Luckily, I had a few hours and the impetus to finally put together some techniques I’d been meaning to investigate.
Like many researchers, I use a camera to take photos of documents during archival research trips. My problem comes when I arrive home with a bunch of photos that look like this:
Ugh. What to do with all these “DSCs”? Here’s a way to convert those images into documents that are actually searchable and usable.
So, first, credit where credit is due: the idea to use Hazel and Automator comes from Shane Landrum, whose excellent talk, “Camera, Laptop, and What Else?: Hacking Better Tools for the Short Archival Research Trip,” I saw at Yale in 2009.
Hazel is a Mac utility ($21.95; free trial for 14 days) that keeps an eye on the folders you specify. At your direction, it’ll perform certain actions on those folders. I use it to rename the images in a folder I’ve called “Research Photos.” In this case, I’ve told Hazel to change my image names from the dreaded “DSC000000” to “NLM” (which stands for the National Library of Medicine) and then the date the photos were taken.
That’s already a lot better. Now I have some assurance that if my photos get moved around, I’ll still at least know which archive they’re from.
But now I have a bunch of JPGs. Personally, I prefer PDFs, because then I can run optical character recognition (OCR) on them with Acrobat Pro, as I explain below.
JPGs to one Big PDFs
Please see updated instructions here for turning all your JPGs into one big PDF.
Creating Searchable PDFs
The easiest way to run OCR, or optical character recognition (which recognizes text in your images), is to use Adobe Acrobat Pro. Even if you don’t have it on your own computer, your university probably has it somewhere. But if you can’t get access to Acrobat Pro, Shane outlines a couple of other options, including Evernote and Ocropus.
Once I’ve opened the PDF in Acrobat Pro, I click on “Document,” “OCR Text Recognition,” and then “Recognize Text Using OCR.”
Now I have a document with text that can be copied and searched. It’s really, really dirty, but it’s better than a plain old image.
Getting Your PDF into Zotero
I like to keep all my research PDFs in Zotero, so I create a Zotero item for the PDF. Then I drag the PDF into Zotero, control-click on the PDF, and then click on “Rename File from Parent Metadata.” That gives my PDF the same title as the Zotero item record I just created.
What’s cool is that I can then search for text right from my Zotero search box. See? It found the word “human” in my PDF!
As a couple workshop attendees noted, this is a pretty unwieldy process. I’d love to see someone make a more streamlined piece of software. Jason Puckett pointed out that Zotero Commons actually goes a long way toward this: it can upload your documents to the Internet Archive, which will then run OCR on them for you. But you have to be sure that your documents can be made publicly available.
What do you use to process your research photos?