Chinese Rare Books Project

The purpose of this project is to add and improve the records for Princeton's Chinese rare books, especially those published from 1796-1911 (a similar project from a few years ago focused on earlier works). The project has several phases. Phases 1 and 2 are based on print versions of the Gest collection catalog compiled by Qu Wanli and Chang Bide (which have been scanned and OCRed). Phase 1 enhances records from these sources that are already in Alma. Phase 2 involves adding those records that were not previously in Alma. Each of these phases consists of about 800 records. Future phases of the project will involve Chinese rare books in Cotsen and Marquand, as well as others in the EAL that were not originally listed in the print catalogs.

In both Phases 1 and 2, records are generated or enhanced offline using automated scripts created by Tom Ventimiglia. These scripts combine the current catalog record (if one exists) with outside information sources, such as the print sources mentioned above and other notes from CHRB project staff. They also add standard fields such as 33x fields, adjust punctuation and capitalization, add romanization, and make other changes so that the records conform to current standards. It is understood that not everything can be automated and the automation itself may introduce some errors. Thus, after the scripts are run, the records are uploaded to Alma, where Shuwen Cao and her student assistants manually review each record and make additional adjustments as needed. The outside source data for the records (and any notes from Tom) are stored in fields 997, 998, and 999 for reference. Records are uploaded in batches of 100 each, and the next batch is prepared when the previous batch is close to complete.

For Phase 1, most of the cataloging work is done in the sandbox, and the records are synced with production once complete. This ensures that incomplete records are not visible in production. The specific workflow for Phase 1 is:

Tom exports a batch of records from Alma production.
Tom runs the offline enhancement scripts on the records.
Tom uploads the records to the Alma sandbox (using the MarcEdit Alma integration), storing them in a set named with the phase and batch number.
Shuwen reviews each record in the set. If a record requires additional review, she makes notes in field 995. Once a record is ready for production, she runs a normalization process to add field 040$e with the code "cgcrb", which indicates that the record was cataloged according to RLG's "Cataloging guidelines for creating Chinese rare book records in machine-readable form" (http://www.eastasianlib.org/ctp/webinars/ChineseRareBook/CRBP_guidelines.pdf).
Once Shuwen is done with the batch, Tom filters out records requiring additional review using indication rules that look at field 995.
For the records that are complete, Tom exports them from the sandbox, removes the 99x fields related to this project, then uploads them to production using the MarcEdit Alma integration. Since the MMS IDs remain the same throughout the workflow, these records are overlaid on the existing records in production

For Phase 2, it is not necessary to use the sandbox, since there are no previous versions of the records in production. Records are uploaded as suppressed, then unsuppressed once complete. The specific workflow is:

Tom generates a batch of records offline. The source data and any notes for Shuwen are stored in fields 997, 998, and 999. Since these are brief records, field 040$e is not added.

Tom imports the records to production using a repository import profile named "Chinese Rare Books Project". This import creates the holdings records as well, using a call number mapping that copies field 084$a to holdings 852$h. (The call numbers are not LC, but rather a special system specifically for Chinese Rare Books. Such call numbers are indicated with the code "gestsk" in subfield $2).

These records are stored in a set named with the phase and batch number. All records are initially suppressed. Tom runs a holdings normalization process "Add 852x to gestsk records", which adds a non-public note to holdings field 852$x indicating that the record was generated by a batch process. The process also sets indicator 1 of holdings field 852 to 7, indicating that an alternate call number system is being used (as specified in $2).

Shuwen reviews each record in the set and makes any needed adjustments. Once a record is complete, she removes the 99x fields related to this project and unsuppresses it, making it visible in production.

Sandbox Refresh

Since the Phase 1 work is being done in the sandbox, this work must be backed up and restored in connection with each sandbox refresh in February and August. The nomalization and indication rules related to the project are in the alma-config github repository, and these are automatically added to the sandbox after each refresh. The same is true for the normalization process “Add 040 $e cgcrb”. However, this process is inactive by default and must be manually activated after the refresh is complete. Also, any record sets related to the project must be backed up prior to the refresh and then restored after. This can be done by exporting each set before the refresh begins. After the refresh, the sets can be recreated by extracting the list of MMS IDs for each set to a text file, then creating a set corresponding to each one. The records themselves can be updated using the MarcEdit Alma integration.