Skip to main content
  • Subscribe by RSS
  • Ex Libris Knowledge Center

    How to – Deduplicate records between two local data sources in Primo Back Office

    Created By: Stacey van Groll
    Created on: 1/22/2021

    Use case

    We often have a physical format record for our University of Queensland Theses catalogued in Alma and then an electronic format record for the same work stored in our Institutional Repository UQ eSpace.  Only our Alma record has data such as Local Call Number and only our UQ eSpace record has data such as the DOI.  The OTB template rules for Primo Back Office only include Dedup for the Alma pipe, and not for Generic Dublin Core used for our UQ eSpace pipe, and therefore we have two records for the same work appearing for our users in our live Primo environment



    Reduce duplication of records, while also improving search by both physical and electronic formats of a work being found with metadata only in one record.



    20% duplication reduction for University of Queensland Theses in our UQ eSpace Institutional Repository (4,200 of 21,000), with many of the remaining born digital from 2008 ie no physical counterpart.

    • Note: Primo Analytics will only report numbers for records which have dedup'd 2 or more records in the same data source, rather than records which have dedup'd across data sources ie 1 record in 1 data source with 1 record in another data source


    Implementation Summary

    Configuration time: A few hours (but additional time for Sandbox testing)

    Configuration difficulty: Medium

    Consistency of outcome: Dependent on good and matching data

    Changes live in Primo UI: Dependent on number of records to index and hotswap, and twice daily hotswap timing, but typically within 12 hours


    System specs

    Primo New UI, with a single local and remote blended search scope

    Alma using MARC21, not using Alma-D

    Institutional repository using Dublin Core, for example UQ eSpace

    Primo Multi-tenant Back Office, with pipes for Alma and Institutional Repository


    Summary of Key Data for Dedup

    Records much reach 875 points to dedup, as a Full Match

    When there is no match or partial match, there will be fewer points, zero points, or points deducted

    Short title Initial score quick match -450 match or 0 no match

    • Note: Even though it is already known that full match is required due to key data giving many points such as ISBN not being present in both data source records, the Short title rules MUST be configured.  There is a bug here that the Dedup System Test will show a match without these rules, and yet the records will not match in UI - this has been reported
      • Update 25.1.2021: Ex Libris have advised the following and I've requested that this be documented: "The test tool compares between 2 records which is a simpler method than the pipe itself. The pipe takes into consideration other records and data already loaded, so it might be possible that the mismatch was due to other data in the pipe which caused the records not to be deduped at the end. It is possible that some other data interfere in the pipe with the dedup mechanism."
      • Documentation note added 30.4.2021: "In some cases, the result of the test tool and the actual pipe may not have the same result. This is because the test tool compares only the two specified records, while a pipe also considers other records and data that are already loaded to determine whether records are duplicates."

    If the following 4 elements match exactly, then there will be 965 points and the records will dedup

    • Year of Publication: 008 pos 7-10 in Alma and dc:date in eSpace
    • Full title: 245 a,b,n,p in Alma and dc:title in eSpace
    • Country of Publication: qea in Alma and write constant qea for eSpace
    • Main Entry aka Author: 100 a,b,c,d,q in in Alma and dc:creator in eSpace

    Extras - failsafe rules were added to empower staff to step in to the automated process if needed, such as to merge difficult matches or prevent undesirable matches. These are the only changes to Alma dedup rules

    • 61UQ_eSpace recordid added to the Alma 992 $a - to force a merge via C5 / F20 by 992
    • DONOTDEDUP added to the Alma 992 $a - to prevent a merge by addition of 99 in T

    Note: The dedup rules will need to be adjusted based on your site's data, and the rules which follow are developed for our site and our data.  Standard disclaimer that I take no responsibility for results at any other site, and strongly recommend extensive Sandbox testing



    Configuration Summary Steps

    • Mapping Table - Publishing > Preferred Record-Delivery Category Priority
      • Only necessary if you need to adjust your preference
    • Normalization Rules - Alma > Dedup
      • Only necessary if you want to add a failsafes for difficult matches to force a merge or to prevent undesirable matches
    • Normalization Rules - UQ eSpace > Dedup
    • Normalization Rules and Code Table - UQ eSpace > Display, Search, Addata
      • Only necessary for display fields duplicated in both local data sources, as the non-preferred format will be hidden
      • Not detailed here by screenshots but summary of adjustments for display field Relation in both Alma and UQ eSapce
        • Display rules: Moved dc:relation from the OTB Relation to lds46 Related with templates for DOI:, ISSN:, and ORCID ID: by string exists
        • Search rules: Left dc:relation in addtitle for boosting, but also added search_orcidid, search_issn, search_doi by string exists
        • Addata rules: Added addata_orcidid and adddata_doi by string exists (additional unrelated improvement)
        • Code Tables: Add lds46 to FrontEnd Display Fields and Full Display Labels
        • Views: Add Related
    • Deploy All - technically only Mapping Tables and Normalization Rules required
    • Run Update Pipe with Force Dedup ticked for UQ eSpace pipe ie the pipe being dedup'd in to existing records
      • Note: Make sure hotswap not running and do this out of hours, as Force Dedup can cause issues in UI such as failures to pin items to My Favourites


    Step 1 - Adjust the Mapping Table - Publishing > Preferred Record-Delivery Category Priority

    • As part of this work, the preference was changed to Alma-P, for more consistent data and standard subject headings in Alma, rather than our Online Resource category records in UQ eSpace.  This was also desirable for our Alma records in general between Alma-P and Alma-E.  While reviewing the existing entries in the mapping table, it was noted that it was OTB and not configured correctly for our Alma environment, so full review was done and irrelevant entries retained but placed lower in the hierarchy

    OTB entries

    Updated entries


    Step 2 - Normalization Rules - Alma > Dedup


    Dedup_c5 (image) and same for Dedup_f20

    Summary: Failsafe for difficult matches by Record ID (not limited by Thesis as merge rules will extend over time)



    Summary: Failsafe to prevent undesirable matches by the record having a 992 $a of DONOTDEDUP. Note: 992 is also configured by norm rule as searchable but not displayed in Primo, for ease of finding records with this failsafe in place.


    Step 3 - Normalization Rules - eSpace > Dedup

    Note: First tick 'Display Empty PNX fields to show the Dedup rules, as all are empty in the OTB Generic Dublin Core Template


    Text for copying for Normalization Rules

    • ^(The|An|A|'A)\s
    • <<.*>>
    • &#x0088;.*&#x0089;
    • !@#$%^&*()_+-={}[]:";<>?,./~`@@^
    • ^.{20}(.*).{10}$
    • T00:00:00Z
    • T00:00:00.000Z
    • (^.{0,30})
    • !@#$%^&*()_+-={}[]:";<>?,./~`@@^



    Summary: Add rule to allow dedup by dc:type only for Thesis by 1 and if not Thesis then 99



    Summary: Normalize dc:title for Short Title




    Summary: Normalize dc:date for Year of Publication



    Dedup_c5 (image) and same for Dedup_f20

    Summary: Failsafe for difficult matches by Record ID (not limited by Thesis as merge rules will extend over time)



    Summary: Normalize dc:title for Short Title




    Summary: Normalize dc:date for Year of Publication



    Summary: Normalize dc:title for Full Title



    Summary: For Country of Publication, add constant qea to restrict more definitively to desired work of local theses and add more points



    Summary: Normalize dc:creator for Main Entry


    Ex Libris Documentation


    Feedback or suggestions