PACON Digitization Reference Document
PACON Digitization Reference Document
version 0.0.1
Retrieving Page Images1
Digitization Specifications and Standards2
Directory and File Naming Structure2
METS XML file profile3
ALTO File Profile4
JPEG2000 Image Profile4
PDF Specifications6
Quality Control Guidelines8
Definitions8
Quality Standards9
Significant and Non-Significant Errors9
Quality Evaluation Guidelines10
Evaluation of Zoning and Segmentation10
Evaluation of Headlines10
Evaluation of Body Text10
Evaluation of Issue Metadata11
Evaluation of XML11
Retrieving Page Images
TO BE DETERMINED
Digitization Specifications and Standards
Directory and File Naming Structure
Each newspaper shall have an acronym:
Princeton Herald | princetonherald |
Princeton Packet | princetonpacket |
Princeton Recollector | princetonrecollector |
Woman's Newspaper | princetonwomans |
The directory structure and file names shall be as follows:
<acronym>/ccyy/mm/dd/<acronym><date><number>.jp2
<acronym>/ccyy/mm/dd/<acronym><date><number>.xml
<acronym>/ccyy/mm/dd/<acronym><date><number>.pdf
.
.
.
<acronym>/ccyy/mm/dd/<acronym>_<date>_mets.xml
<acronym>/ccyy/mm/dd/<acronym>_<date>.pdf
where <acronym> is an acronym for the newspaper title, <date> is CCYYMMDD, and <number> is a 4 digit image/page sequence number. All directory and file names shall use lower case characters. For example, a hypothetical 2 page issue of the Princeton Herald newspaper would be comprised of the following files:
princetonherald/1874/01/01/princetonherald_18740101_0001.jp2
princetonherald/1874/01/01/princetonherald_18740101_0001.pdf
princetonherald/1874/01/01/princetonherald_18740101_0001.xml
princetonherald/1874/01/01/princetonherald_18740101_0002.jp2
princetonherald/1874/01/01/princetonherald_18740101_0001.pdf
princetonherald/1874/01/01/princetonherald_18740101_0002.xml
princetonherald/1874/01/01/princetonherald_18740101_mets.xml
princetonherald/1874/01/01/princetonherald_18740101.pdf
where
princetonherald is the acronym for the Princeton Herald newspaper,
1874 is the year of publication,
01 is the month of publication (possible values are 01 to 12),
01 is the day of publication (possible values are 01 to 31),
0001 is an image number (possible values are four-digit numbers),
. jp2 is an extension indicating the image is a JPEG2000 file.
.xml is an extension indicating the file is an ALTO file.
_mets.xml indicates the file is an issue METS file.
.pdf with no image number indicates that the file is an issue PDF file.
.pdf with an image number indicates that the file is a page PDF file.