Digitization Optimized for OCR


*Text-based

Archival Master File

Optical capture resolution

Bit depth

Embedded color/gray profile

Notes

*Image specs optimized for OCR

Printed computer documents

Uncompressed TIFF v.6

300-400 PPI

8/24

Adobe RGB (1998)/Gray Gamma 2.2

Size of text should be considered when determining resolution

Typed documents

Uncompressed TIFF v.6

300-400 PPI

8/24

Adobe RGB (1998)/Gray Gamma 2.2

Size of text should be considered when determining resolution

Printed publication matter

Uncompressed TIFF v.6

300-400 PPI

8/24

Adobe RGB (1998)/Gray Gamma 2.2

Size of text should be considered when determining resolution

Printed matter on microform

Uncompressed TIFF v.6

*3500 PPI

8

Gray Gamma 2.2

*Accounts for magnification ratio


This standard defines best practices for creating reference images, digitizing text-based material that is slated for Optical Character Recognition (OCR) processing, either immediately after digitization or at some point in the future. These standards prioritize legibility and can be made using automated devices, such as flatbed or orbital scanners.

While material digitized according to these standards will often be in fairly robust condition, it may be vulnerable due to inherent vice in the medium (such as brittle paper or deterioration of a film substrate). Conservation assessment is recommended prior to digitization.

Digitization standards for preservation

Collections that contain any color (in images or text) should be photographed entirely in color, though this should be at the discretion of project stakeholders. Digitization for OCR will often benefit from grayscale capture with modest contrast enhancement. Master files for materials that were originally produced in grayscale or bitonal, such as microforms, should be digitized in 8-bit grayscale. For microforms, higher resolution imaging is required to account for the magnification of the original document relative to its representation on film. For example, an 8.5x11-inch document, captured on 35mm microfilm represents an approximately .12 rate of magnification; 3500 x .12 = 420 (PPI). Dedicated scanners are required for this type of imaging.

Master File Format: All master files should be uncompressed, Tagged Image File Format (TIFF) version 6, in either “little endian” (IBM PC) or “big endian” (Mac) byte order. Lossless LZW compression may be acceptable in some cases but introduces a level of mathematical processing that can introduce errors in compression or decompression. Lossy compression in TIFF images, such as JPEG compression, is not an acceptable archival format. In addition, all files must pass JHOVE[1] format validation.


Resolution: Image capture resolution is measured in pixels per inch (PPI). This should be a true optical resolution; the lens and pixel array in the capture device should be capable of creating an image file to the required resolution specification without interpolation.


Bit Depth:

Color (RGB):

  • Images are captured natively in 24-bit RGB RAW or TIFF format and Master Files are exported as 24-bit TIFF files with the “Adobe RGB (1998)” color profile embedded.

Grayscale:

  • Master Files are saved in 8-bit mode and should be embedded with the “Gray Gamma 2.2” profile.

Editing:

  • All images should be cropped to include the entire item/object, leaving a small background border around the material to show the entirety of a page or object. Black borders are preferred but there are exceptions, such as dark originals, outsourced projects or image file collections created by a third party.
  • While sophisticated image viewers can easily rotate an image, the master image file should be oriented properly. For bound materials with pages of varying orientation, default to the orientation of the binding.


[1] JSTOR Harvard Object Validation Environment. http://jhove.openpreservation.org/