How to – Deduplicate records between two local data sources in Primo Back Office
Use case
We often have a physical format record for our University of Queensland Theses catalogued in Alma and then an electronic format record for the same work stored in our Institutional Repository UQ eSpace. Only our Alma record has data such as Local Call Number and only our UQ eSpace record has data such as the DOI. The OTB template rules for Primo Back Office only include Dedup for the Alma pipe, and not for Generic Dublin Core used for our UQ eSpace pipe, and therefore we have two records for the same work appearing for our users in our live Primo environment
Goal
Reduce duplication of records, while also improving search by both physical and electronic formats of a work being found with metadata only in one record.
Outcome
20% duplication reduction for University of Queensland Theses in our UQ eSpace Institutional Repository (4,200 of 21,000).
Implementation Summary
Configuration time: A few hours (but additional time for Sandbox testing)
Configuration difficulty: Medium
Consistency of outcome: Dependent on good and matching data
Changes live in Primo UI: Dependent on number of records to index and hotswap, and twice daily hotswap timing, but typically within 12 hours
System specs
Primo New UI, with a single local and remote blended search scope
Alma using MARC21, not using Alma-D
Institutional repository using Dublin Core, for example UQ eSpace
Primo Multi-tenant Back Office, with pipes for Alma and Institutional Repository
Summary of Key Data for Dedup
Records much reach 875 points to dedup, as a Full Match
When there is no match or partial match, there will be fewer points, zero points, or points deducted
Short title Initial score quick match -450 match or 0 no match
- Note: Even though it is already known that full match is required due to key data giving many points such as ISBN not being present in both data source records, the Short title rules MUST be configured. There is a bug here that the Dedup System Test will show a match without these rules, and yet the records will not match in UI - this has been reported
- Update 25.1.2021: Ex Libris have advised the following and I've requested that this be documented: "The test tool compares between 2 records which is a simpler method than the pipe itself. The pipe takes into consideration other records and data already loaded, so it might be possible that the mismatch was due to other data in the pipe which caused the records not to be deduped at the end. It is possible that some other data interfere in the pipe with the dedup mechanism."
If the following 4 elements match exactly, then there will be 965 points and the records will dedup
- Year of Publication: 008 pos 7-10 in Alma and dc:date in eSpace
- Full title: 245 a,b,n,p in Alma and dc:title in eSpace
- Country of Publication: qea in Alma and write constant qea for eSpace
- Main Entry aka Author: 100 a,b,c,d,q in in Alma and dc:creator in eSpace
Extras - a failsafe rule was added for difficult matches to force a merge via C5 / F20 by 992, which was the only change to Alma dedup rules
- 61UQ_eSpace recordid added to the Alma 992 $a
Note: The dedup rules will need to be adjusted based on your site's data, and the rules which follow are developed for our site and our data. Standard disclaimer that I take no responsibility for results at any other site, and strongly recommend extensive Sandbox testing
Configuration Summary Steps
- Mapping Table - Publishing > Preferred Record-Delivery Category Priority
- Only necessary if you need to adjust your preference
- Normalization Rules - Alma > Dedup
- Only necessary if you want to add a failsafe for difficult matches to force a merge
- Normalization Rules - UQ eSpace > Dedup
- Normalization Rules and Code Table - UQ eSpace > Display, Search, Addata
- Only necessary for display fields duplicated in both local data sources, as the non-preferred format will be hidden
- Not detailed here by screenshots but summary of adjustments for display field Relation in both Alma and UQ eSapce
- Display rules: Moved dc:relation from the OTB Relation to lds46 Related with templates for DOI:, ISSN:, and ORCID ID: by string exists
- Search rules: Left dc:relation in addtitle for boosting, but also added search_orcidid, search_issn, search_doi by string exists
- Addata rules: Added addata_orcidid and adddata_doi by string exists (additional unrelated improvement)
- Code Tables: Add lds46 to FrontEnd Display Fields and Full Display Labels
- Views: Add Related
- Deploy All - technically only Mapping Tables and Normalization Rules required
- Run Update Pipe with Force Dedup ticked for UQ eSpace pipe ie the pipe being dedup'd in to existing records
- Note: Make sure hotswap not running and do this out of hours, as Force Dedup can cause issues in UI such as failures to pin items to My Favourites
Step 1 - Adjust the Mapping Table - Publishing > Preferred Record-Delivery Category Priority
- As part of this work, the preference was changed to Alma-P, for more consistent data and standard subject headings in Alma, rather than our Online Resource category records in UQ eSpace. This was also desirable for our Alma records in general between Alma-P and Alma-E. While reviewing the existing entries in the mapping table, it was noted that it was OTB and not configured correctly for our Alma environment, so full review was done and irrelevant entries retained but placed lower in the hierarchy
OTB entries
Updated entries
Step 2 - Normalization Rules - Alma > Dedup
Dedup_c5 (image) and same for Dedup_f20
Summary: Failsafe for difficult matches by Record ID (not limited by Thesis as merge rules will extend over time)
Step 3 - Normalization Rules - eSpace > Dedup
Note: First tick 'Display Empty PNX fields to show the Dedup rules, as all are empty in the OTB Generic Dublin Core Template
Text for copying for Normalization Rules
- ^(The|An|A|'A)\s
- <<.*>>
- ˆ.*‰
- !@#$%^&*()_+-={}[]:";<>?,./~`@@^
- ^.{20}(.*).{10}$
- T00:00:00Z
- T00:00:00.000Z
- (^.{0,30})
- !@#$%^&*()_+-={}[]:";<>?,./~`@@^
Dedup_t
Summary: Add rule to allow dedup by dc:type only for Thesis by 1 and if not Thesis then 99
Dedup_c3
Summary: Normalize dc:title for Short Title
Dedup_c4
Summary: Normalize dc:date for Year of Publication
Dedup_c5 (image) and same for Dedup_f20
Summary: Failsafe for difficult matches by Record ID (not limited by Thesis as merge rules will extend over time)
Dedup_f5
Summary: Normalize dc:title for Short Title
Dedup_f6
Summary: Normalize dc:date for Year of Publication
Dedup_f7
Summary: Normalize dc:title for Full Title
Dedup_f8
Summary: For Country of Publication, add constant qea to restrict more definitively to desired work of local theses and add more points
Dedup_f11
Summary: Normalize dc:creator for Main Entry
Ex Libris Documentation
- https://knowledge.exlibrisgroup.com/Primo/Product_Documentation/Primo/Technical_Guide/030Duplicate_Detection_Process/030Duplicate_Detection_Vector
- https://knowledge.exlibrisgroup.com/Primo/Product_Documentation/Primo/Technical_Guide/020Working_with_Normalization_Rules/050Transformation_Routines
- https://knowledge.exlibrisgroup.com/Primo/Knowledge_Articles/Using_Primo_Force_Dedup_and_Eshelf
- https://knowledge.exlibrisgroup.com/Primo/Product_Documentation/020Primo_VE/040Dedup_and_FRBR_for_Primo_VE/010Understanding_the_Dedup_and_FRBR_Processes_(Primo_VE)
- https://knowledge.exlibrisgroup.com/Primo/Product_Documentation/Primo/Technical_Guide/030Duplicate_Detection_Process/020Dedup_and_Transitivity
- https://knowledge.exlibrisgroup.com/Primo/Product_Documentation/Primo/Technical_Guide/050Matching_Records_in_the_Serials_and_Non-Serials_Dedup_Algorithm
- https://knowledge.exlibrisgroup.com/Primo/Product_Documentation/Primo/Technical_Guide/180Files_Used_by_the_Dedup_Algorithm
- https://knowledge.exlibrisgroup.com/Primo/Product_Documentation/Primo/Technical_Guide/050Matching_Records_in_the_Serials_and_Non-Serials_Dedup_Algorithm/040Matching_Programs
- https://knowledge.exlibrisgroup.com/Primo/Product_Documentation/Primo/Technical_Guide/050Matching_Records_in_the_Serials_and_Non-Serials_Dedup_Algorithm/030Structure_of_the_XML_File
- https://knowledge.exlibrisgroup.com/Primo/Product_Documentation/Primo/Technical_Guide/030Duplicate_Detection_Process
Feedback or suggestions
- Stacey van Groll
- Discovery and Access Coordinator
- University of Queensland
- s.vangroll@library.uq.edu.au