PDF Scanning and Optical Character Recognition

Overview

Scanning physical documents and converting them to PDFs saves the entire document contents as images.

The most fundamental component of PDF accessibility is ensuring that any text on the document is searchable. Screen readers and other assistive technologies are not able to read text off of images, or interpret the structure of documents that are saved as images. If you scan a document and save it as a PDF, you need to perform Optical Character Recognition (OCR) on it as a precursor to any additional accessibility checks. This article details how to perform OCR, and provides tips on creating better quality document scans.

Note: This article is for PDFs created from scanning, or converted from image files. PDFs exported from Word and other content editing interfaces already have recognizable and searchable text.

Guidelines

Avoid Scanning Documents Whenever Possible

The University of Oregon has access to many online journals, and a librarian might be able to find a version of your resources already digitized.

If You Must Scan, Start With a High Quality Source

OCR works best on documents that are:
- Computer generated text
- High resolution
- Clear and legible
Whenever possible, avoid:
- Handwriting
- Notes on the page, including underlining and highlighting on the text, and notes in the margins
- Documents with rips and stains
- Scanning the book binding
If scanning source material that can easily be removed from its binding, do so.
Scan items in the correct orientation.
Use a minimum of 300 dpi scanner settings for text, and consider the highest settings if your document has complex diagrams, scientific notation, or other nonstandard characters.
If the scanner provides an option to create a "Searchable PDF", select it. This automatically performs OCR at the time of scanning.

How to Test

Check for Live Text

To test if the PDF has actual, recognized text, open the PDF and try to select the text. If you can highlight the text with your cursor, it's recognized. If you cannot highlight the text, it is part of the image and does not get recognized by assistive tools.
You can also test this by attempting a text search. Use ctrl-f to bring up the text search box, and search for a term you know is in the document.
The screenshot shows a scan of the United States Constitution, without searchable text. Note that the text cannot be selected. This image has text that has highly stylized handwriting, on a document with significant wear. It is unlikely for automated character recognition to correctly identify the text after running the OCR tool.

US Constitution as PDF without searchable text

Compare that with a screenshot of the text of the 27th amendment, saved as PDF with searchable text. Note that that text is highlighted, as on a word processor. This document was originally an image, but because it uses clear and legible computer-generated text, the OCR tool was able to correctly parse the text.

PDF screenshot of 27th amendment with searchable text

How to Perform OCR

If your text is searchable, you're already done with this step! If your text is not searchable, here's how to perform OCR.

Setup Adobe Acrobat

Add the Scan & OCR tool to the tools pane of Adobe Acrobat. Under the Tools tab, find Scan & OCR. Click the "Add" button and it will be added to the Tools sidebar. You will likely use the Accessibility and Action Wizard tools in later accessibility testing steps, so add those tools while you are here. Once you add the tools to the sidebar, the Add button changes to Open, as seen in the screenshot. Return to your document.

Adobe Acrobat adding the OCR tool

Run the OCR Tool

Select Scan & OCR from the sidebar.

Acrobat Scan & OCR tool in the sidebar

The tool opens a new options toolbar with scanning options. Choose the Enhance option. Make sure Recognize Text is checked, then click Enhance. Depending on the size of the document, it may take a minute.

Acrobat Scan & OCR toolbar options

Provide Accessible Alternatives if Necessary

Verify that the text is now searchable. If it is not, running Enhance multiple times sometimes produces better results. If, after multiple Enhances, it still does not recognize the text, the source image is not suitably legible for the OCR tool.
- If there is a reason that necessitates the use of the original source document (e.g. an image of the original Constitution is preferable to a font-based recreation) consider uploading both the original, inaccessible version, and an accessible alternative.

Digital Accessibility @ UO Menu

Digital Accessibility @ UO

PDF Scanning and Optical Character Recognition

Overview

Guidelines

Avoid Scanning Documents Whenever Possible

If You Must Scan, Start With a High Quality Source