PACON Digitization Reference Document

PACON Digitization Reference Document
version 0.0.1
Retrieving Page Images1
Digitization Specifications and Standards2
Directory and File Naming Structure2
METS XML file profile3
ALTO File Profile4
JPEG2000 Image Profile4
PDF Specifications6
Quality Control Guidelines8
Definitions8
Quality Standards9
Significant and Non-Significant Errors9
Quality Evaluation Guidelines10
Evaluation of Zoning and Segmentation10
Evaluation of Headlines10
Evaluation of Body Text10
Evaluation of Issue Metadata11
Evaluation of XML11


Retrieving Page Images

TO BE DETERMINED

Digitization Specifications and Standards

Directory and File Naming Structure

Each newspaper shall have an acronym:

Princeton Herald

princetonherald

Princeton Packet

princetonpacket

Princeton Recollector

princetonrecollector

Woman's Newspaper

princetonwomans



The directory structure and file names shall be as follows:
<acronym>/ccyy/mm/dd/<acronym><date><number>.jp2
<acronym>/ccyy/mm/dd/<acronym><date><number>.xml
<acronym>/ccyy/mm/dd/<acronym><date><number>.pdf
.
.
.
<acronym>/ccyy/mm/dd/<acronym>_<date>_mets.xml
<acronym>/ccyy/mm/dd/<acronym>_<date>.pdf
where <acronym> is an acronym for the newspaper title, <date> is CCYYMMDD, and <number> is a 4 digit image/page sequence number. All directory and file names shall use lower case characters. For example, a hypothetical 2 page issue of the Princeton Herald newspaper would be comprised of the following files:
princetonherald/1874/01/01/princetonherald_18740101_0001.jp2
princetonherald/1874/01/01/princetonherald_18740101_0001.pdf
princetonherald/1874/01/01/princetonherald_18740101_0001.xml
princetonherald/1874/01/01/princetonherald_18740101_0002.jp2
princetonherald/1874/01/01/princetonherald_18740101_0001.pdf
princetonherald/1874/01/01/princetonherald_18740101_0002.xml
princetonherald/1874/01/01/princetonherald_18740101_mets.xml
princetonherald/1874/01/01/princetonherald_18740101.pdf
where

  • princetonherald is the acronym for the Princeton Herald newspaper,
  • 1874 is the year of publication,
  • 01 is the month of publication (possible values are 01 to 12),
  • 01 is the day of publication (possible values are 01 to 31),
  • 0001 is an image number (possible values are four-digit numbers),
  • . jp2 is an extension indicating the image is a JPEG2000 file.
  • .xml is an extension indicating the file is an ALTO file.
  • _mets.xml indicates the file is an issue METS file.
  • .pdf with no image number indicates that the file is an issue PDF file.
  • .pdf with an image number indicates that the file is a page PDF file.


METS XML file profile

METS files shall be created and delivered for each newspaper issue.

  1. Metadata for each newspaper issue shall be encoded in one METS files conforming to METS version 1.11 (cf. {+}http://www.loc.gov/standards/mets/version111/mets.xsd+. For additional information also see {+}http://www.loc.gov/standards/mets/+.)
  2. The following publication data shall be encoded as MODS XML descriptive metadata in the METS files:
    1. Publisher name
    2. Publication title
    3. Issue language
    4. Issue volume, and number
    5. Page sequence number and printed page number
    6. Corrected newspaper article titles
    7. Article language
    8. Article continuation links
  3. Publication title, publisher name (if available), issue language, issue volume and number (if applicable), and page sequence and printed page number shall all be encoded to 99.95% accuracy (5 or fewer errors in 10,000 characters).
  4. Articles titles (Headlines) shall be correct to 99.95% accuracy (5 or fewer errors in 1000 characters).
  5. If an article has continuations, all continuations shall be linked. Linking shall be done with 95% accuracy.
  6. Information about missing pages and printing anomalies shall be encoded in the METS files.
  7. Article type information shall be encoded in the METS files. Articles shall have one of the following types.
    1. Article : Covers all types of articles, including local, national, or international news stories, opinion or commentary articles, obituaries, as well as editorials, usually, but not exclusively, written by the editor.
    2. Illustration: Photographs, illustrations and graphics of all types, including its accompanying caption. Included here are illustrations that do not have an accompanying article. Includes maps, charts, drawings, info-graphics. Political and social commentary cartoons that reflect a current situation are to be included here; these usually (but not always) appear on the same page as the editorials.
  8. The following will be grouped together in a single METS div element labeled "section" (as in newspaper "section")
    1. Advertisements: Ranges from items for sale to situations vacant and public notices. Format is usually different from news articles. Includes classified advertisements, shipping notices and schedules, and notices of all types other than death notices. Covers all advertisements, including those under the header "Notices", "Appointments".

ALTO File Profile

ALTO files shall be created and delivered for each newspaper article.

  1. The OCR text from each newspaper page image shall be encoded using ALTO schema version 3.1 (https://www.loc.gov/standards/alto/v3/alto-3-1.xsd). For additional information see http://www.loc.gov/standards/alto.
  2. The OCR text shall be in the natural reading order for the language that has been OCR'd.
  3. Point size and font data to at least the word level shall be included in the ALTO files.
  4. ALTO files shall have bounding box coordinates to at least the word level.
  5. Non-rectangular blocks shall not be used; Some illustration may format as "tight" in the document.

JPEG2000 Image Profile

  1. JPEG2000 images shall be produced in general accordance with the JPEG 2000 Profile for the US Library of Congress's National Digital Newspaper Program (NDNP) ({+}http://www.loc.gov/ndnp/pdf/NDNP_JP2HistNewsProfile.pdf+) and Appendix B of the US Library of Congress's NDNP Technical Guidelines ({+}http://www.loc.gov/ndnp/pdf/NDNP_200709TechNotes.pdf+).
  2. The JPEG 2000 files will conform to the JP2 file format as specified in ISO/IEC 15444- 1:2000 (i.e., JPEG 2000, Part 1).
  3. The JPEG 2000 files shall be prepared from the source images after any image processing or clean-up is performed. The JPEG 2000 files shall correspond with the image that is used for OCR.
  4. The JPEG 2000 files shall have a ".jp2" extension.
  5. The JPEG 2000 files image X origin, image Y origin, tile X origin, and tile Y origin shall be 0.
  6. The JPEG 2000 files shall contain only one component. The bit depth of that component shall be the same as the source image file: 1-bit for black-and-white source images, 8-bits for gray scale source images, and 24-bits for 24-bit color source images.
  7. The JPEG 2000 files' tile headers shall not contain coding style default, coding style component, quantization default, and quantization component marker segments.
  8. The JPEG 2000 file progression order shall be RLCP (resolution, layer, component, position) or RLPC.
  9. The JPEG 2000 files shall have 6 decomposition levels.
  10. The JPEG 2000 files shall have 25 quality layers. The bits per pixel for each quality layer will be: 1, 0.84, 0.7, 0.6, 0.5, 0.4, 0.35, 0.3, 0.25, 0.21, 0.18, 0.15, 0.125, 0.1, 0.088, 0.07, 0.0625, 0.05, 0.04419, 0.03716, 0.03125, 0.025, 0.0221, 0.018, 0.015625.
  11. The JPEG 2000 file code-block sizes will be 64x64. The JPEG 2000 file code-block styles shall be bypass.
  12. Two compression schemes shall be used for the JPEG 2000 files. For 1-bit source image files, CCITT Group 4 compression (lossless) shall be used. For all other bit depths, the 9-7 irreversible filter shall be used.
  13. The JPEG 2000 files shall use 1024x1024 tiles.
  14. The JPEG 2000' file color specification must be either the monochrome (grayscale) enumerated color space or the Monochrome Input restricted ICC profile..
  15. The JPEG 2000 files shall not contain regions of interest or precincts.
  16. The JPEG 2000 files shall not contain intellectual property rights information.
  17. The JPEG 2000 file will contain an XML Box that conforms with the following:



<?xml version="1.0"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdfsyntax-ns#">
<rdf:Description rdf:about="urn:[princetonherald]:newspaper:page://#Date of publication in CCYY-MM-DD#/#Edition order#/#Page sequence number#"xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:format>image/jp2</dc:format>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="en">#Newspaper title#. #Date of publication in CCYY-MM-DD# [p #page label#].
</rdf:li>
</rdf:Alt>
</dc:title>
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="en">Page from #Newspaper title# (newspaper). Prepared on behalf of the Princeton Area Online Newspaper Project.
</rdf:li>
</rdf:Alt>
</dc:description>
<dc:date>
<rdf:Seq>
<rdf:li xml:lang="x-default">#Date of publication in CCYY-MM-DD#</rdf:li>
</rdf:Seq>
</dc:date>
<dc:type>
<rdf:Bag>
<rdf:li xml:lang="en">text</rdf:li><rdf:li xml:lang="en">newspaper</rdf:li>
</rdf:Bag>
</dc:type>
</rdf:Description>
</rdf:RDF>


PDF Specifications

PDF files will be delivered for each issue.

  1. PDF files shall have a ".pdf" extension.
  2. PDF files shall contain only one image of one page of a newspaper page with text "behind" the image.
  3. The images contained in the PDF files shall be 1-bit bitonal images derived from the source images, down-sampled to 150dpi and JPEG encoded, using a medium (or 40) quality setting.
  4. Only the 14 standard Type1 fonts may be used. These fonts will not be embedded.
  5. The page may have a page label. The page label shall be the page number as it appears in the image.
  6. The PDF text streams will be Flate encoded.
  7. The PDF files shall not contain any links, named destinations, comments, forms, JavaScript actions, external cross references, alternate images, embedded thumbnails, annotations, or private data.
  8. The PDF files shall not be tagged. (Note: PDFs of newspapers tagged with Adobe Acrobat's automated tagging are generally inaccurate and not useful for Read Aloud functionality.)
  9. The PDF files shall open to Fit Page sizing.
  10. The PDF files shall open to single page layout.
  11. The PDF files shall open with neither document outline nor thumbnail images available.
  12. The PDF files shall open with the tool bar, menu bar, and user interface elements visible.
  13. The PDF files shall not open centered in the screen.
  14. The PDF files shall not be encrypted, digitally signed, or have any security.
  15. It is recommended that the PDF files be linearized (also known as "Fast Web View").
  16. The PDF files shall be compatible with Acrobat 5.0 or later (PDF version 1.4).
  17. Insofar as this does not conflict with any of the other above requirements and is appropriate, the PDF files shall conform to PDF/A (ISO 19005-1).
  18. The PDF files XMP metadata shall conform with the following:



<rdf:Description rdf:about="#The appropriate uuid#" xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:format>application/pdf</dc:format>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="en">#Newspaper title#. (Princeton). #Date of publication in CCYY-MM-DD# [p #page label#].</rdf:li>
</rdf:Alt>
</dc:title>
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="en">Page from #Newspaper title# (newspaper). Prepared on behalf of the Princeton Area Online Newspaper Project.</rdf:li>
</rdf:Alt>
</dc:description>
<dc:date>
<rdf:Seq>
<rdf:li xml:lang="x-default">#Date of publication in CCYY-MM-DD#</rdf:li>
</rdf:Seq>
</dc:date>
<dc:type>
<rdf:Bag>
<rdf:li xml:lang="en">text</rdf:li>
<rdf:li xml:lang="en">newspaper</rdf:li>
</rdf:Bag>
</dc:type>
</rdf:Description>


Quality Control Guidelines

Definitions

Article Segmentation

The division of an article or other text across columns or pages; in digitization, the process of analyzing these divisions and encoding it in metadata.

Batch

A segment of the total run of newspapers processed by VENDOR as a unit. Batches are identified and enumerated on tracking sheets shared by VENDOR and Princeton.

Entry Operator

VENDOR staff member responsible for converting raw image files into required output files using docWorks.

Issue Metadata

The title, volume, issue, date, and page number order of an issue.

Quality Criteria

The critical requirements metrics and standards as described here and in the contract, including accuracy rates and criteria for acceptance or rejection of work produced.

Quality Control (QC)

The procedures used by VENDOR to ensure compliance with the established quality standards.

Quality Evaluation and Acceptance (QEA)

The procedures used by Princeton University to apply acceptance criteria to work produced by VENDOR.

QC Operator

A member of VENDOR's quality control team.

Significant Error

A discrepancy between the source copy as represented on a page image and the computer-readable text output produced by VENDOR that is counted when calculating error rates for compliance with quality standards.

Quality Standards

  1. Headlines and bylines will be 99.95% free of significant errors as defined above.
  2. Issue metadata will be 99.95% free of significant errors.
  3. Body text (defined as text occurring in articles, editorials and other printed material but not advertisements) will be 90% free of significant errors.
  4. Article segmentation will be 99% free of significant errors.
  5. The XML files produced during digitization will be well-formed and valid against the following schemas where appropriate:
    1. http://schema.ccs-gmbh.com/docworks/alto-1-2.xsd
    2. http://schema.ccs-gmbh.com/docworks/mets-docWORKS.xsd
    3. http://www.w3.org/TR/xlink
    4. http://www.loc.gov/mix
    5. http://www.w3.org/1999/02/22-rdf-syntax-ns#

Significant and Non-Significant Errors

Within guidelines of the Quality Standards above:
The following discrepancies will be considered significant:

  • incorrect characters (e.g., c for e)
  • transposed characters (e.g., teh for the)
  • missing characters (e.g., tht for that)
  • inserted characters (e.g., c a t for cat)

The following discrepancies will not be considered significant:

  • differences in capitalization (gOlD is equivalent to Gold)
  • padded spacing (more than one whitespace character when original has only one)
  • line breaks

Quality Evaluation Guidelines

When VENDOR has completed a batch, it shall submit output to the Princeton Area Online Newspaper Project for ingestion into Princeton's Veridian installation.
Upon receipt of a batch, PACON will have 30 days to accept or reject it. PACON will employ the following procedures to evaluate a batch.

Evaluation of Zoning and Segmentation

PACON will examine five (5) issues for zoning and segmentation errors. A zoning or segmentation error shall be identified using the following criteria:

  1. If an article is missing one or more zones, that shall count as an error.
  2. If an article contains one or more zones that are not part of the article in the original, it shall be counted as an error.
  3. If the sequence of zoned regions is not in the proper order, it shall count as an error.

PACON will record the results of these examinations in its report. If significant errors are discovered, Princeton and VENDOR will confer.

Evaluation of Headlines

PACON will draw a random sample of 350 headlines from the batch and measure its accuracy by comparing each headline with its original and recording all significant errors. The accuracy rate of the sample should meet or exceed 99.95%.

Evaluation of Body Text

PACON will examine the fifth article in every sample (as above) for errors in the body text, counting the number of significant errors that occur, where significance is as defined above.
We shall assume an average article length of 500 words. For an accuracy rate of 90% or better an article should therefore contain no more than 50 significant errors.
PACON and VENDOR shall agree that if more than 5% of the samples from a given batch fail, the batch is not acceptable.

Evaluation of Issue Metadata

PACON will evaluate 1 random 2,000 character sampling of Issue Metadata. If the sample contains more than 1 error (i.e., it falls below 99.95%), the issue metadata fails and must be rechecked and resubmitted by VENDOR.

Evaluation of XML

PACON will not verify XML validity, as docWorks automatically validates the files it produces against the appropriate schemas.