Harvesting source records and creating PNX records are managed by the Publishing Platform. The publishing platform supports scheduled and unattended harvesting and processing of various data formats, allowing interactive monitoring and control over the entire set of activities.
Within the publishing platform, PNX records are created by publishing pipes. Every data source has its own pipe. Each data source may have its own set of normalization rules, or several data sources may be linked to one set of normalization rules.
This section covers the following aspects of pipes:
The Define Pipe page allows you to add and update pipes. After you have created or updated a pipe, you will need to execute the pipe to create or update the PNX records. For information on executing and monitoring pipes, see Monitoring Pipe Status.
To create an effective pipe for your system, first create your data sources, normalization mapping sets, and enrichment sets.
Define Pipe Page
To define a new pipe:
- Click Pipe Configuration Wizard on the Ongoing Configuration Wizard page.The Pipe Configuration Wizard page opens.You can also access the Define Pipe page by clicking Create new pipe on the Primo Home > Monitor Primo Status > Pipe Monitoring page.
- Click Pipes Configuration.The Pipes Configuration page opens.Pipe Configuration Page
- Click Define Pipe.The Define Pipe page opens (see Define Pipe Page).
- Select the name of the institution from the Owner drop-down list. For institution-level staff users, your institution will already be selected.For installation-level users, you must select an institution before the associated values appear in the drop-down lists that display the Select Institution value.
- In the Pipe Name field, enter the name of the new pipe.The Pipe name is composed of letters, numbers, and/or the underscore character.
- In the Pipe Description field, enter a description for the new pipe.
- Enter the remaining fields as described in the following table.
Define Pipe Details Field name DescriptionPipe TypeIndicates the type of pipe. The following types are valid:
The default value is Regular.When running pipes (such as pipes set to No Harvesting - Update Data Source) that add or change a large amount of data, it is recommended that you stop Oracle archiving, as this slows down the process and fills up the disk. Immediately after the process is complete, perform a full cold backup and then turn archiving back on.Records that are deleted and re-inserted using the Delete Data Source and Reload option may be included with the tally of the updated records (instead of the deleted and inserted records) in the pipe’s log.Data SourceThe data source of the pipe.Normalization Mapping SetThe normalization set used to map the source records to the PNX.PriorityThis field defines the priority of the pipe: Low, Medium, High, and Critical.Pipes with the highest priority run first. The default setting is Medium.Maximum error thresholdThe maximum percent of errors allowed until the system stops running the pipe.Harvesting methodThe method used to harvest the source information. The following methods can be selected: FTP, Copy, OAI, and SFTP.If Copy is selected, the user must have read permission for the directory.Enrichment SetThe enrichment set used to enrich the records.Harvested File FormatIndicates the format of the harvested file. The following values are valid: *.tar.gz, *.tar, *.gz, *.warc, *.warc.gz, and *.zip.This field is not available with all types of pipes, such as Delete Data Source.The *.gz, *.warc, *.warc.gz, and *.zip formats require the data source to use the WARC file splitter.Start harvesting files/records fromThe date from which to harvest the records.
- Regular – This type of pipe uses records harvested from the data source to create, update, and delete PNX records. For more information on the stages of pipe execution, see Configuring the Publishing Platform Pipe Flow.
- Delete Data Source – This type of pipe is used to delete a data source from the Primo database, including data from dedup and FRBR groups. It removes all previously harvested records from the P_PNX and P_SOURCE_RECORD tables for the specified data source. In addition, it removes all tags and reviews.
- No Harvesting – Update Data Source – This pipe is similar to a “Regular” pipe, but records are not harvested from the data source. It uses all of the previously harvested source records from the P_SOURCE_RECORD table instead of the data source. This type of pipe is typically used when it is necessary to re-normalize and/or enrich all records from a specific data source (for example, due to a change in normalization rules).
- Delete Data Source and Reload – This pipe is similar to the Regular pipe, but if first removes all harvested records from the P_PNX and P_SOURCE_RECORD tables before reloading the PNX records from the data source. This option is intended for data sources (such as MetaLib) that have to harvest the entire database each time. This ensures that deleted records from the data source are removed from Primo.
This date is updated after each successful run of the pipe to ensure that all harvested files have been processed completely.Start timeThe time from which to harvest the records.System Last StageThis field allows you to change the last stage that is run during the execution of a pipe. By default, this field is set to FRBR, the last stage of pipe execution. The following values are valid:
- For FTP/Copy this is the date and time of the file to harvest. Following harvesting, this date is updated with the date of the latest harvest file.
- For OAI this is the date and time on which the file is to be updated. Following harvesting this is updated with the date of the request.
This field does not display when the Parallel Processing of Pipes mode is set to Harvesting, NEP on the General Configuration page.Include DEDUPIndicates whether the Dedup stage will be executed when the Parallel Processing of Pipes mode is set Harversting, NEP on the General Configuration page.Include FRBRIndicates whether the FRBR stage will be executed when the Parallel Processing of Pipes mode is set Harversting, NEP on the General Configuration page.Force DEDUPIndicates whether Dedup processing is performed on PNX records that have no changes to the dedup section. This allows you to apply changes made to the Dedup rules.If the pipe is not configured to run the Dedup stage, Dedup processing will not be forced regardless of this setting.Force FRBRIndicates whether FRBR processing is performed on PNX records that have no changes to the frbr section. This allows you to apply changes made to the FRBR rules.If the pipe is not configured to run the FRBR stage, FRBR processing will not be forced regardless of this setting.ServerThe IP used to access the server.This field appears only if the harvesting method is OAI, FTP, or SFTP.For OAI, the system supports the HTTPS protocol for harvesting.UsernameThe user name used to access the server.This field appears only if the harvesting method is FTP or SFTP.PasswordThe password used to access the server.This field appears only if the harvesting method is FTP or SFTP.Metadata format (OAI only)All OAI-PMH compliant repositories can return records in Dublin Core format. The Dublin Core format is usually expressed as oai_dc, but some repositories use a different code. Enter the term used by your repository.This field appears only if the harvesting method is OAI.Set (OAI only)OAI repositories may organize items into sets, allowing you to selectively harvest information. Specify the name of the set if you want to harvest only a specific part of the OAI repository.This field appears only if the harvesting method is OAI.Source directoryThe directory of the source record. This is used for copy only.This field appears only if the harvesting method is Copy, FTP, or SFTP.Delete after copyIndicates whether the system should delete the source files after the harvest. If selected, the files are deleted as follows, per Harvesting method:
- PERSISTENCE – This option stops the execution of the pipe after loading records to the database. Note that the Dedup and FRBRization stages are not executed.
- DEDUP – This option stops the execution of the pipe after the Dedup stage. Note that the FRBRization stage is not executed.
- FRBR – This default option stops the execution of the pipe after the FRBRization process completes.
- FRBR WITHOUT DEDUP – This option skips the Dedup stage and stops the execution of the pipe after the FRBRization process completes.
stop harvest errorIf this check box is not selected, the source files are not removed from their respective directories after harvesting.After the harvest, the system stores a copy of the source files in the harvest directory. To view the harvested files, enter the following commands:
- Copy – The files are removed from the directory on the Primo server.
- FTP/SFTP – The files are removed from the directory on the source server. If the staff user does not have write permissions to the source files, the system will stop the pipe and log the following error:
Configure Server LocaleWhen this field is selected, this page opens the Server Locale field.This field appears only if the harvesting method is FTP.Server LocaleSelect a locale from the drop-down list.This field appears only if the harvesting method is FTP and the Configure Server Locale check box is selected.By default, the harvester assumes the locale of the server is English. If the locale of your server is different, you must select the relevant locale.
- cd <pipe_name>/<data_source>/<timestamp-of-the-pipe_run>/harvest
- For FTP, OAI, and SFTP harvesting methods, click Test Connection to verify the connection to the server.
- Click Save.
Editing a Pipe
You can edit the pipe details if the pipe is not running.
To edit a Pipe:
- On the Primo Home > Monitor Primo Status > Pipe Monitoring page, click Edit next to the pipe that you want to update.The Define Pipe page opens, showing the details of the specified pipe (see Define Pipe Page).
- Edit the fields according to Define Pipe Details.
- Click Save to update the pipe's settings.
Deleting a Pipe
You can delete a pipe that has not been executed. After it has been executed, you must open a Support ticket to have it deleted.
When a pipe is deleted, the system will also delete any schedules created for the pipe.
To delete a Pipe:
- On the Primo Home > Monitor Primo Status > Pipe Monitoring page, click Edit next to the pipe that you want to delete.The Define Pipe page displays the specified pipe's details (see Define Pipe Page).
- Click Delete Pipe to delete the pipe.