- Article Type: Q&A
- Product: Rosetta
- Product Version: 5.3
What are the OAI-PMH Harvester best practices, limitations and known issues?
Prior to setting up OAI-PMH Harvester for migration legacy repository see article on the developer network: https://developers.exlibrisgroup.com/blog/Migrating-from-Your-Legacy-Digital-Repository-to-Rosetta
Make sure you understand the purpose of the OAI-PMH Harvester use. You can use it:
- To synchronize metadata of IEs already ingested in Rosetta with external system:
- Objects originating from Rosetta.
- Objects created in external system which still manage the metadata.
- From version 5.3. the OAI-PMH harvester can match records by any DC or DCTERMS identifier, not only based on external origin identifier (OAI Header ID) or the Rosetta origin (dc:identifier).
- Schedule an Update Metadata Job to perform the on-going updates.
- To load new objects (data streams + metadata) into Rosetta for the first time:
- The OAI-PMH Harvester will create SIPs (Dublin Core or METS xml with metadata).
- Use Do not match (duplicate) in Match parameter in the OAI-PMH Harvester configuration.
- When user name and password are left empty in the OAI-PMH Harvester configuration no authentication will be performed.
- Submission job will upload the data streams referenced in these SIPs into Rosetta.
- File references in the SIP xml must point to an actual file (not a resolver, e.g. DigiTool's Delivery Manager) and, if using an URL, it must contain a legitimate filename. Note that the "Migrating Your Digital Repository to Rosetta" article includes XSL transformation examples for the dc:identifier (e.g. DSpace, DigiTool, DigitalCommons, ContentDM)
- In Content structure define stream source origin (typically dc:identifier).
- Use OAI-PMH Harvester configuration Test area to verify the connectivity and your XSL Transformation.
- Use 'ignore last run time' checkbox when repeating test ingests.
- Test submission job on larger sets to see if the source streams can be download.
- The download of the stream files via http can take time. For larger migrations the file references in SIP xml should point to a NFS location mounted to Rosetta
- Rosetta will access all URLs provided the SIP xml stream origin (dc:identifier). Be sure that there are valid objects on these URLs. Think about how the source systems handles versioning or delete of the files.
- Article last edited: 06-Nov-2017