Duplicate Detection Process
This information is not applicable to Primo VE environments. For more details on the Primo VE configuration, see Dedup and FRBR for Primo VE.
The duplication detection or matching process is based on creating a dedup vector for every PNX record in the Primo database. The vector includes all the data required by the Duplication Detection algorithm to determine if two records are equivalent.
Duplicate records are assigned the same matchID. For every group of duplicate records, the dedup process creates a merged record based on a preferred record (which is selected by the Preferred Record algorithm). In the current version, the preferred record is based on the delivery category and the record with the most fields in the Display section. The default priority is:
-
SFX resource
-
Online resource
-
Metalib resource
-
Physical item
-
Microform
In addition to the preferred record, fields from the Control, Links, Search, Facets, Additional Data, and Delivery sections are merged from all records and are included in the dedup merged record. All local fields are included. The availability and source fields from the Display section are also retained for every record.
Only the merged record is retrieved and indexed by the Search Engine.
The dedup vector is created during the normalization process. Duplicate record detection is a separate process carried out after the records have been loaded to the Primo database.
When a record is loaded into the Primo database, the system first checks if the record is new by searching for a record with the same record_id. If a match is not found, the record is considered to be a new record, and the system attempts to find a matching record based on the vector. If a match is found, the record is assigned the MatchID of the record with which it matched. Once a match is found, the matching process stops. If a match is not found, the record is assigned a new MatchID.
If there is a match on the RecordID, the system compares the vector in the incoming record with its vector in the Primo database. If the vector is the same, the record is assigned its current MatchID. If it is not the same, the record is treated as a new record (the MatchID is removed) and the system attempts to find it a matching record, as described above.
Once all matching records are located, the system creates a merged record.
For information on the Dedup algorithm or changing the thresholds, see Files Used by the Dedup Algorithm.
For more information about the Dedup process, see the following sections: