Why is the Discovery Import Profile record I deleted from my repository still appearing in Primo VE?
Question
Why is the Discovery Import Profile record I deleted from my repository still appearing in Primo VE?
Answer
When working with "external" records in Primo VE, many customers ask, "Why is the Discovery Import Profile record I deleted from my repository still appearing in Primo?"
It is common for Discovery Import Profiles (DIPs) to harvest records using OAI-PMH method from a customer's repository or a repository a customer has subscribed to.
The most likely answer to this question is that the repository is not configured to handle deleted records.
In your browser, use the repository's OAI url with the parameter 'verb=Identify' to see properties about the repository.
For example: <repository OAI url>?verb=Identify
such as https://oai.datacite.org/oai?verb=Identify
The response is usually presented in XML format.
In the snippet below from oai.datacite.org, notice the value of the <deletedRecord> element in the <Identify> section.
<?xml version="1.0" ?><?xml-stylesheet type="text/xsl" href="/static/oaitohtml.xsl"?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2023-09-20T15:04:28Z</responseDate>
<request verb="Identify">https://oai.datacite.org/oai</request>
<Identify>
<repositoryName>DataCite</repositoryName>
<baseURL>https://oai.datacite.org/oai</baseURL>
<protocolVersion>2.0</protocolVersion>
<adminEmail>support@datacite.org</adminEmail>
<earliestDatestamp>2011-01-01T00:00:00Z</earliestDatestamp>
<deletedRecord>persistent</deletedRecord> <--LOOK HERE!
<granularity>YYYY-MM-DDThh:mm:ssZ</granularity>
<compression>gzip</compression>
[...etc]
In this example, the value of <deletedRecord> is "persistent", however for most customers asking this question, the value of their repository's <deletedRecord> is "no".
The meaning of all fields in the Identify section are described by the Open Archives Initiave here:
https://www.openarchives.org/OAI/openarchivesprotocol.html#Identify
<deletedRecord> field is described as: "the manner in which the repository supports the notion of deleted records. Legitimate values are no ; transient ; persistent with meanings defined in the section on deletion."
https://www.openarchives.org/OAI/openarchivesprotocol.html#DeletedRecords
The value "no" is described as, "the repository does not maintain information about deletions. A repository that indicates this level of support must not reveal a deleted status in any response."
"If a repository does not keep track of deletions then such records will simply vanish from responses and there will be no way for a harvester to discover deletions through continued incremental harvesting. "
In this context, Primo VE is a "harvester", and has no way to know a record is deleted because when <deletedRecord> is "no", it is not possible for the repository to tell Primo that the record has been deleted.
This means, records that have been deleted in the source repository are simply not included in the OAI-PMH response to Primo, so when using incremental updates, Primo does not know it should delete these records.
The OAI Spec further states, "If a repository does keep track of deletions then the datestamp of the deleted record must be the date and time that it was deleted. Responses to 'GetRecord' request for a deleted record must then include a header with the attribute status="deleted", and must not include metadata or about parts."
When Primo VE harvests a record with "a header with the attribute status="deleted"", Primo will delete that record if it exists in Primo.
If a customer's repository is set to <deletedRecord>no</deletedRecord>, they can check with the Repository administrator to determine if the configuration can be changed.
A value of "transient" or "persistent" may include deleted records in the OAI-PMH response.
OR, as described in the Primo VE Guide*, the DIP can be run occasionally as "Reload and Delete". This option will reload the entire DIP from the repository and will delete any records already existing in Primo from that DIP that were not included in the reload. (because they were deleted in the source repository).
Also, as mentioned in the Guide, the "Harvest Start Date" field should be set appropriately, or cleared, before using this option.