Skip to main content
  • Subscribe by RSS
  • Ex Libris Knowledge Center

    OAI-PMH Harvester best practices, limitations and known issues

    • Article Type: Q&A
    • Product: Rosetta
    • Product Version: 5.3


    What are the OAI-PMH Harvester best practices, limitations and known issues?


    Prior to setting up OAI-PMH Harvester for migration legacy repository see article on the  developer network:

    Make sure you understand the purpose of the OAI-PMH Harvester use. You can use it:  

    1. To synchronize metadata of IEs already ingested in Rosetta with external system:
      • Objects originating from Rosetta.
      • Objects created in external system which still manage the metadata.
      • From version 5.3. the OAI-PMH harvester can match records by any DC or DCTERMS identifier, not only based on external origin identifier (OAI Header ID) or the Rosetta origin (dc:identifier). 
      • Schedule an Update Metadata Job to perform the on-going updates.
    2. To load new objects (data streams + metadata) into Rosetta for the first time:  
      • The OAI-PMH Harvester will create SIPs (Dublin Core or METS xml with metadata).
      • Use Do not match (duplicate) in Match parameter in the OAI-PMH Harvester configuration.
      • When user name and password are left empty in the OAI-PMH Harvester configuration no authentication will be performed.
      • Submission job will upload the data streams referenced in these SIPs into Rosetta.
      • File references in the SIP xml must point to an actual file (not a resolver, e.g. DigiTool's Delivery Manager) and, if using an URL, it must contain a legitimate filename. Note that the "Migrating Your Digital Repository to Rosetta" article includes XSL transformation examples for the dc:identifier (e.g. DSpace, DigiTool, DigitalCommons, ContentDM)
      • In Content structure define stream source origin (typically dc:identifier).

    Testing phase:

    • Use OAI-PMH Harvester configuration Test area to verify the connectivity and your XSL Transformation.
    • Use 'ignore last run time' checkbox when repeating test ingests.
    • Test submission job on larger sets to see if the source streams can be download.

    Other recommendations:

    • The download of the stream files via http can take time. For larger migrations the file references in SIP xml should point to a NFS location mounted to Rosetta
    • Rosetta will access all URLs provided the SIP xml stream origin (dc:identifier). Be sure that there are valid objects on these URLs. Think about how the source systems handles versioning or delete of the files.

    Note: OAI Transformation xsl currently supports XSLT 1.0

    • Article last edited: 29-Mar-2018