Skip to main content
ExLibris
  • Subscribe by RSS
  • Ex Libris Knowledge Center

    How to – Deduplicate records between two local data sources in Primo Back Office

    Created By: Stacey van Groll
    Created on: 1/22/2021



    Use case

    We often have a physical format record for our University of Queensland Theses catalogued in Alma and then an electronic format record for the same work stored in our Institutional Repository UQ eSpace.  Only our Alma record has data such as Local Call Number and only our UQ eSpace record has data such as the DOI.  The OTB template rules for Primo Back Office only include Dedup for the Alma pipe, and not for Generic Dublin Core used for our UQ eSpace pipe, and therefore we have two records for the same work appearing for our users in our live Primo environment

     

    Goal

    Reduce duplication of records, while also improving search by both physical and electronic formats of a work being found with metadata only in one record.

     

    Outcome

    20% duplication reduction for University of Queensland Theses in our UQ eSpace Institutional Repository (4,200 of 21,000).

     

    Implementation Summary

    Configuration time: A few hours (but additional time for Sandbox testing)

    Configuration difficulty: Medium

    Consistency of outcome: Dependent on good and matching data

    Changes live in Primo UI: Dependent on number of records to index and hotswap, and twice daily hotswap timing, but typically within 12 hours

     

    System specs

    Primo New UI, with a single local and remote blended search scope

    Alma using MARC21, not using Alma-D

    Institutional repository using Dublin Core, for example UQ eSpace

    Primo Multi-tenant Back Office, with pipes for Alma and Institutional Repository

     

    Summary of Key Data for Dedup

    Records much reach 875 points to dedup, as a Full Match

    When there is no match or partial match, there will be fewer points, zero points, or points deducted

    Short title Initial score quick match -450 match or 0 no match

    • Note: Even though it is already known that full match is required due to key data giving many points such as ISBN not being present in both data source records, the Short title rules MUST be configured.  There is a bug here that the Dedup System Test will show a match without these rules, and yet the records will not match in UI - this has been reported
      • Update 25.1.2021: Ex Libris have advised the following and I've requested that this be documented: "The test tool compares between 2 records which is a simpler method than the pipe itself. The pipe takes into consideration other records and data already loaded, so it might be possible that the mismatch was due to other data in the pipe which caused the records not to be deduped at the end. It is possible that some other data interfere in the pipe with the dedup mechanism."

    If the following 4 elements match exactly, then there will be 965 points and the records will dedup

    • Year of Publication: 008 pos 7-10 in Alma and dc:date in eSpace
    • Full title: 245 a,b,n,p in Alma and dc:title in eSpace
    • Country of Publication: qea in Alma and write constant qea for eSpace
    • Main Entry aka Author: 100 a,b,c,d,q in in Alma and dc:creator in eSpace

    Extras - a failsafe rule was added for difficult matches to force a merge via C5 / F20 by 992, which was the only change to Alma dedup rules

    • 61UQ_eSpace recordid added to the Alma 992 $a

    Note: The dedup rules will need to be adjusted based on your site's data, and the rules which follow are developed for our site and our data.  Standard disclaimer that I take no responsibility for results at any other site, and strongly recommend extensive Sandbox testing

     

     

    Configuration Summary Steps

    • Mapping Table - Publishing > Preferred Record-Delivery Category Priority
      • Only necessary if you need to adjust your preference
    • Normalization Rules - Alma > Dedup
      • Only necessary if you want to add a failsafe for difficult matches to force a merge
    • Normalization Rules - UQ eSpace > Dedup
    • Normalization Rules and Code Table - UQ eSpace > Display, Search, Addata
      • Only necessary for display fields duplicated in both local data sources, as the non-preferred format will be hidden
      • Not detailed here by screenshots but summary of adjustments for display field Relation in both Alma and UQ eSapce
        • Display rules: Moved dc:relation from the OTB Relation to lds46 Related with templates for DOI:, ISSN:, and ORCID ID: by string exists
        • Search rules: Left dc:relation in addtitle for boosting, but also added search_orcidid, search_issn, search_doi by string exists
        • Addata rules: Added addata_orcidid and adddata_doi by string exists (additional unrelated improvement)
        • Code Tables: Add lds46 to FrontEnd Display Fields and Full Display Labels
        • Views: Add Related
    • Deploy All - technically only Mapping Tables and Normalization Rules required
    • Run Update Pipe with Force Dedup ticked for UQ eSpace pipe ie the pipe being dedup'd in to existing records
      • Note: Make sure hotswap not running and do this out of hours, as Force Dedup can cause issues in UI such as failures to pin items to My Favourites

     

    Step 1 - Adjust the Mapping Table - Publishing > Preferred Record-Delivery Category Priority

    • As part of this work, the preference was changed to Alma-P, for more consistent data and standard subject headings in Alma, rather than our Online Resource category records in UQ eSpace.  This was also desirable for our Alma records in general between Alma-P and Alma-E.  While reviewing the existing entries in the mapping table, it was noted that it was OTB and not configured correctly for our Alma environment, so full review was done and irrelevant entries retained but placed lower in the hierarchy

    OTB entries

    Updated entries

     

    Step 2 - Normalization Rules - Alma > Dedup

     

    Dedup_c5 (image) and same for Dedup_f20

    Summary: Failsafe for difficult matches by Record ID (not limited by Thesis as merge rules will extend over time)

     

    Step 3 - Normalization Rules - eSpace > Dedup

    Note: First tick 'Display Empty PNX fields to show the Dedup rules, as all are empty in the OTB Generic Dublin Core Template

     

    Text for copying for Normalization Rules

    • ^(The|An|A|'A)\s
    • <<.*>>
    • &#x0088;.*&#x0089;
    • !@#$%^&*()_+-={}[]:";<>?,./~`@@^
    • ^.{20}(.*).{10}$
    • T00:00:00Z
    • T00:00:00.000Z
    • (^.{0,30})
    • !@#$%^&*()_+-={}[]:";<>?,./~`@@^

     

    Dedup_t

    Summary: Add rule to allow dedup by dc:type only for Thesis by 1 and if not Thesis then 99

     

    Dedup_c3

    Summary: Normalize dc:title for Short Title

     

     

    Dedup_c4

    Summary: Normalize dc:date for Year of Publication

     

     

    Dedup_c5 (image) and same for Dedup_f20

    Summary: Failsafe for difficult matches by Record ID (not limited by Thesis as merge rules will extend over time)

     

    Dedup_f5

    Summary: Normalize dc:title for Short Title

     

     

    Dedup_f6

    Summary: Normalize dc:date for Year of Publication

     

    Dedup_f7

    Summary: Normalize dc:title for Full Title

     

    Dedup_f8

    Summary: For Country of Publication, add constant qea to restrict more definitively to desired work of local theses and add more points

     

    Dedup_f11

    Summary: Normalize dc:creator for Main Entry

     

    Ex Libris Documentation

     

    Feedback or suggestions