Chinese Rare Books Project

The purpose of this project is to add and improve the records for Princeton's Chinese rare books, especially those published from 1796-1911 (a similar project from a few years ago focused on earlier works). The project has several phases. Phases 1 and 2 are based on print versions of the Gest collection catalog compiled by Qu Wanli and Chang Bide (which have been scanned and OCRed). Phase 1 enhances records from these sources that are already in Alma. Phase 2 involves adding those records that were not previously in Alma. Each of these phases consists of about 800 records. Future phases of the project will involve Chinese rare books in Cotsen and Marquand, as well as others in the EAL that were not originally listed in the print catalogs.

In both Phases 1 and 2, records are generated or enhanced offline using automated scripts created by Tom Ventimiglia. These scripts combine the current catalog record (if one exists) with outside information sources, such as the print sources mentioned above and other notes from CHRB project staff. They also add standard fields such as 33x fields, adjust punctuation and capitalization, add romanization, and make other changes so that the records conform to current standards. It is understood that not everything can be automated and the automation itself may introduce some errors. Thus, after the scripts are run, the records are uploaded to Alma, where Shuwen Cao and her student assistants manually review each record and make additional adjustments as needed. The outside source data for the records (and any notes from Tom) are stored in fields 997, 998, and 999 for reference. Records are uploaded in batches of 100 each, and the next batch is prepared when the previous batch is close to complete.

For Phase 1, most of the cataloging work is done in the sandbox, and the records are synced with production once complete. This ensures that incomplete records are not visible in production. The specific workflow for Phase 1 is:

Tom exports a batch of records from Alma production.
Tom runs the offline enhancement scripts on the records.
Tom uploads the records to the Alma sandbox (using the MarcEdit Alma integration), storing them in a set named with the phase and batch number.
Shuwen reviews each record in the set. If a record requires additional review, she makes notes in field 995. Once a record is ready for production, she runs a normalization process to add field 040$e with the code "cgcrb", which indicates that the record was cataloged according to RLG's "Cataloging guidelines for creating Chinese rare book records in machine-readable form" (http://www.eastasianlib.org/ctp/webinars/ChineseRareBook/CRBP_guidelines.pdf).
Once Shuwen is done with the batch, Tom filters out records requiring additional review using indication rules that look at field 995.
For the records that are complete, Tom exports them from the sandbox, removes the 99x fields related to this project, then uploads them to production using the MarcEdit Alma integration. Since the MMS IDs remain the same throughout the workflow, these records are overlaid on the existing records in production

For Phase 2, it is not necessary to use the sandbox, since there are no previous versions of the records in production. Records are first loaded into WorldCat and revised there, then the final version is exported to Alma. The specific workflow is:

Tom generates a batch of records offline. The source data and any notes for Shuwen are stored in fields 997, 998, and 999.
Tom sends the batch of records to Shuwen, who imports them to a local save file in OCLC Connexion (a separate save file is created specifically for this batch). Shuwen makes the needed revisions in this file and publishes them to WorldCat. She then exports the completed records to an external file and sends them back to Tom. (It may be that some records will not be published to WorldCat right away because further research is needed. Shuwen lets Tom know about such records so he can delete them from the file before importing to Alma).
Tom imports the records to production using a repository import profile named "Chinese Rare Books Project". This import creates the holdings records as well.
These records are stored in a set named with the phase and batch number. Tom runs a “Change holdings information job” to apply normalization process "Add 852$2 to gestsk records". The process sets indicator 1 of holdings field 852 to 7, indicating that an alternate call number system is being used (as specified in $2), and adds the subfield $2 itself with the value “gestsk”.
Toms runs another “Change holdings information” job and checks the box “Update call number from the bibliographic record”. This updates the call number using a mapping that copies field 084$a to holdings 852$h. (These call numbers are not LC, but rather a special system specifically for Chinese Rare Books.) This must be done as a separate job as the previous step, because the indicator 1 of the 852 must first be set in order for the job to know what call number mapping to apply.

Sandbox Refresh

Since the Phase 1 work is being done in the sandbox, this work must be backed up and restored in connection with each sandbox refresh in February and August. The nomalization and indication rules related to the project are in the alma-config github repository, and these are automatically added to the sandbox after each refresh. The same is true for the normalization process “Add 040 $e cgcrb”. However, this process is inactive by default and must be manually activated after the refresh is complete. Also, any record sets related to the project must be backed up prior to the refresh and then restored after. This can be done by exporting each set before the refresh begins. After the refresh, the sets can be recreated by extracting the list of MMS IDs for each set to a text file, then creating a set corresponding to each one. The records themselves can be updated using the MarcEdit Alma integration.

Chinese Rare Books Project

Related content