PACON Digitization Reference Document

PACON Digitization Reference Document

PACON Digitization Reference Document
version 0.0.1
Retrieving Page Images1
Digitization Specifications and Standards2
Directory and File Naming Structure2
METS XML file profile3
ALTO File Profile4
JPEG2000 Image Profile4
PDF Specifications6
Quality Control Guidelines8
Definitions8
Quality Standards9
Significant and Non-Significant Errors9
Quality Evaluation Guidelines10
Evaluation of Zoning and Segmentation10
Evaluation of Headlines10
Evaluation of Body Text10
Evaluation of Issue Metadata11
Evaluation of XML11

 

Retrieving Page Images

TO BE DETERMINED

Digitization Specifications and Standards

Directory and File Naming Structure

Each newspaper shall have an acronym:

Princeton Herald

princetonherald

Princeton Packet

princetonpacket

Princeton Recollector

princetonrecollector

Woman's Newspaper

princetonwomans



The directory structure and file names shall be as follows:
<acronym>/ccyy/mm/dd/<acronym><date><number>.jp2
<acronym>/ccyy/mm/dd/<acronym><date><number>.xml
<acronym>/ccyy/mm/dd/<acronym><date><number>.pdf
.
.
.
<acronym>/ccyy/mm/dd/<acronym>_<date>_mets.xml
<acronym>/ccyy/mm/dd/<acronym>_<date>.pdf
where <acronym> is an acronym for the newspaper title, <date> is CCYYMMDD, and <number> is a 4 digit image/page sequence number. All directory and file names shall use lower case characters. For example, a hypothetical 2 page issue of the Princeton Herald newspaper would be comprised of the following files:
princetonherald/1874/01/01/princetonherald_18740101_0001.jp2
princetonherald/1874/01/01/princetonherald_18740101_0001.pdf
princetonherald/1874/01/01/princetonherald_18740101_0001.xml
princetonherald/1874/01/01/princetonherald_18740101_0002.jp2
princetonherald/1874/01/01/princetonherald_18740101_0001.pdf
princetonherald/1874/01/01/princetonherald_18740101_0002.xml
princetonherald/1874/01/01/princetonherald_18740101_mets.xml
princetonherald/1874/01/01/princetonherald_18740101.pdf
where

  • princetonherald is the acronym for the Princeton Herald newspaper,

  • 1874 is the year of publication,

  • 01 is the month of publication (possible values are 01 to 12),

  • 01 is the day of publication (possible values are 01 to 31),

  • 0001 is an image number (possible values are four-digit numbers),

  • . jp2 is an extension indicating the image is a JPEG2000 file.

  • .xml is an extension indicating the file is an ALTO file.

  • _mets.xml indicates the file is an issue METS file.

  • .pdf with no image number indicates that the file is an issue PDF file.

  • .pdf with an image number indicates that the file is a page PDF file.

 

METS XML file profile