Harvesting Data

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

One of the key functions of the Primo system is harvesting data to Primo’s local repository from multiple data sources and normalizing the records to the Primo Normalized XML (PNX) record format. This process is performed by the Primo Publishing Platform using pipes, which harvest, normalize, and enrich the records. For additional information on the publishing platform, refer to Overview of the Publishing Process.

Primo supports several harvesting methods including FTP, Copy, and OAI-PMH.

Since the initial harvest generally processes all records in the database and may be very large, we recommend using the Copy method because it is faster than the FTP or OAI-PMH methods. Subsequent harvests include only new, updated, and deleted records.

When setting up your data for harvesting, all the information that you want to include in the PNX record should be included in a single record that is identified by a unique and persistent ID. This means that any information that is related to the main record (for example, holdings information related to the bibliographic record) must be appended to the main record.

For ongoing harvesting of new and updated records, the data source should be able to support incremental harvesting of only the new/updated records. If this is not possible, Primo does have the option of a Delete Data Source and Reload type of pipe that first deletes all records of the harvested data source from the database.

In principle, Primo can harvest and normalize any XML record. In terms of the physical structure of the record, the following formats can be used as examples: MARCXML (http://www.loc.gov/standards/marcxml/) or Dublin Core XML (http://dublincore.org/documents/dc-xml-guidelines/). Note that the record can include non-standard MARC or Dublin Core fields. If you have XML records in a different layout, consult with the Primo implementation team.

The XML records should be in the UTF-8 character set.

In addition to XML, Primo can also harvest MARC exchange records (ISO 2709). These records can be in either the UTF-8 character set or the MARC8 character set. Note that any related holdings information should be embedded in the bibliographic record. Separate holdings records linked to the bibliographic record cannot be harvested.

Primo has template normalization rules for MARC, MAB, and Dublin Core (in addition to specific rules for specific systems, including Aleph, Voyager, Unicorn, SFX, MetaLib, and DigiTool), which can be customized. In addition, Primo can also harvest and process records that are already in the PNX format.

This section includes: