Skip to main content
ExLibris
  • Subscribe by RSS
  • Ex Libris Knowledge Center

    PCI to CDI - A comparison of FRBR grouping and Match and Merge

    Created By: Stacey van Groll
    Created on: 6/12/2020



    Background

    The following is a comparison table that I created as part of our CDI transition project, to help understand the key differences between the Primo Central Index (PCI) FRBR grouping and the Central Discovery Index (CDI) Match and merge process, for non-unique content in the respective indexes.

    It is shared in the hope it is useful to others.

     

    Non-unique record handling Primo Central Index - PCI Central Discovery Index - CDI
    Methods

    FRBR group

    • A single record is presented to users, called the 'preferred record'
    • The record is chosen dynamically from the grouped records based on factors such as strong metadata, and is enhanced with metadata from other grouped records, creating the VSR or Virtual Super Record, which can be seen in PNX for the Preferred record by data such as <lsr30>VSR-Enriched:[subject, pqid, pages]</lsr30>
    • There is full visibility of the original data of all the grouped records, via 'See all'

    Match and merge

    • A single record is presented to users, called the 'logical record', which is also sometimes called the 'master' or 'super' record
    • The record is not a single chosen record from a particular vendor collection, but instead is a new record which is a synthesised combination of all available metadata points from two or more source records
    • There is no visibility of the original source records, which are called the 'physical records' or 'participant records'
    Data sources PCI external records do not group with local records, such as from Alma No change: CDI external records do not group with local records, such as from Alma
    Types of records

    Not fully documented, but known grouping for 7 of 23 types at least:

    Articles, Books, Conference Proceedings, Journals, Newspaper Articles, Reviews, Text Resources

    June 2020: The documentation is now swapped to detail the types which will not merge, rather than the types which will merge, but notably, the content types listed are a mix of expected types on CDI such as 'Report', and original Summon types such as 'Technical Report' (which is merged into 'Report' for CDI):

    Excluded types: Archival Material, Image, Microform, Music Recording, Patent, Report, Technical Report, Standard, Video Recording

    Documentation currently advises that the following 9 of 31 types will be merged, but it has been advised in SalesForce that this list is not complete and that content metadata is the key factor for merge (see Conditions below):

    Articles, Book, Book Reviews (previously Reviews), Conference Proceedings, Dissertations, Journals, Newspapers (new type), Newspaper Articles, Reference Entries

    Record Presentation

    1. TN_elsevier_sdoi_10_1016_j_forsciint_2011_01_009 - PREFERRED RECORD
    2. TN_scopus2-s2.0-79952443942
    3. TN_wos000288200400063
    4. TN_proquest864989457
    5. TN_medline21288672
    6. TN_proquest857131906
    7. TN_crossref10.1016/j.forsciint.2011.01.009
    8. TN_proquest1033482587
    9. TN_proquest1671334784

    20200614_PCI_Example_Preferred_Record_SvG - raw PNX

    20200614_PCI_Example_Full_Record_SvG - screenshot of full record display

    TN_cdi_proquest_journals_1033482587

    Source details has 3 collections listed in full record display, but there are 47 DBIDs in the PNX, aka Provider Package Codes

    • Research Library
    • MEDLINE (Ovid)
    • Proquest

    20200614_CDI_Example_Logical_Record_SvG - raw PNX

    20200614_CDI_Example_Full_Recod_SvG - screenshot of full record display

    Conditions

    • FRBR grouping for PCI is based on the local FRBRization process, but PCI FRBR is also an IP process by Ex Libris, which is therefore not wholely documented
      • Local - Primo will assign keys to important bits of data within a record, creating a vector for that record based on one or more keys
        • These keys are based purely on Title and Author fields
          • Title only match: If there is a 130 Main Entry Uniform Title
            • This is sufficient to match only on the Title as the K2 ID
          • Title + Author match: If there is no 130 Main Entry Uniform Title
            • The match points will be a combination of 100, 110, 111, 700, 710, 711 as the K1 ID and 240, 245, 246, 247, 740, 245 as the K3 ID
            • If a title and author match is required, the record must have an instance of an author field (1xx, 7xx) in order to create the required key hash and FRBR correctly
              • For example: Macquarie dictionary with a 245 but no 700 will never become part of a FRBR group
    • The process is based on unique keys which are variant combinations of metadata elements, received from vendors (DOI+Publication_Year for example)

    • The Grouping process sets a unique group ID for each Primo Central record based on the PNX metadata.

    • In some cases where a record's PNX contains poor metadata, e.g. lack of identifiers (ISSN/ISBN), the record might not get grouped as expected.

    • According to defined Grouping rules (FRBR), records will receive the same group ID if describe the same item

    • It also incorporates matching checks on additional data such as identifier (eg ISBN / ISSN)

    • Resource type in the adddata PNX is also a FRBR key, to prevent different types from grouping together
    • In order to avoid grouping of records with different types (book with article for example), one of the FRBR keys is based on the record type from the Additional Data section in the PNX

    • These keys are built in order to cover the possible indexing differences between the vendors and group similar records together

    • Those keys are being calculated for all Primo Central records (also non FRBRed records) as default, to ensure that each new record that matches existing one in the index, will be grouped properly

    • In sum, even records not even a FRBR group will have a FRBR ID and their own group of one

    Match and merge is a process by a system of rules or series of criteria, governing which participant records may merge into a logical record, and under what conditions, and which may not merge, by filters. The input is the participant records and the output is the logical record. The process is a balance, between overstrict rules that would mean many duplicates in discovery, and loosening rules which would incorrectly merge records. The following is distilled from: Match and Merge in CDI documentation, made available originally in May 2020 and then updated in June 2020, and The Match & Merge Story webinar January 2020

    Note: The June 2020 update references some unexpected content types, as the original Summon types such as 'Magazine Article' and 'Trade Publication Article', rather than only the CDI type of 'Article':

    • Matching Rules with key identifiers
      • Identifiers are a key factor for the match and merge process
      • There are two types of identifiers: External or Internal
      • External identifiers include ISBN, EISSN, DOI, PMID etc
      • Internal identifiers are assigned to records by Ex Libris, to represent a title in the central index
    • Internal Identifiers
      • Records with an ISSN, EISSN, ISBN, or EISBN are assigned an internal title level identifier
      • These internal identifiers are used differently for the match and merge process, depending on if the match under consideration is at the article level or publication level
      • Article level
        • Relevant for types: Journal Article, Magazine Article, Newspaper Article, Trade Publication Article, Book Review, and Conference Proceeding
        • Match requirements: Match to the title level identifier, as well as to the DocumentTitle by fuzzy title match, year of publication, Volume, Issue, and StartPage
        • June 2020 detail removed: The most common match and merge scenario in use for CDI
      • Publication level
        • Relevant for types: Newspaper, Magazine, Journal, and Book
        • Match requirements: The only requirement is match to the title level identifier
    • External Identifiers
      • DOI – Match requirements: DOI to DOI, as well as fuzzy title match. Exclusions: Does not apply to Journal or eJournal
      • PMID (PubMed ID) – Match requirements: PMID to PMID, as well as fuzzy title match. Exclusions: Does not apply to Journal or eJournal
      • ISBN/EISBN – ISBN/EISBN to ISBN/EISBN, as well as fuzzy title match and year of publication must be within one year of each other
      • ISSN/EISSN – ISSN/EISSN to ISSN/EISSN, as well as a match to either PublicationPlace or publication year. Also: Applies only to Journal or eJournal
      • LCCN (Library of Congress control number) – LCCN to LCCN. Variations: If Journal or eJournal; either PublicationPlace or publication year must also match. If Book, eBook, Dissertation, or Government Document; fuzzy title match and year of publication match is also required
      • OCLC – OCLC to OCLC. Variations: If Journal or eJournal; either PublicationPlace or publication year must also match. If Book, eBook, Dissertation, or Government Document; fuzzy title match and year of publication match is also required
    • Additional factors and notes
      • "Fuzzy title match" is a comparison of two records' combined DocumentTitle and DocumentSubtitle fields, disregarding case, whitespace, punctuation, diacritics, and other special characters. A common variation is '&' aka ampersand and 'and'
      • Dissertation type records may merge by URL alone, if they have the same URL (direct link)
      • Reference Entry type records may merge only by fuzzy title match (June 2020 detail removed: if they are published within one year of each other)
      • Publication date year - June 2020 detail removed: Exception: The rule for eBooks is to prevent merge if the difference in publication dates is greater than three years
      • Transitive Merge
        • This occurs when several records are merged, where both records A and C match with record B as a result of metadata in record B that unites records A and C, which would otherwise not be matched
        • These are known to lead to undesirable results, for example when eBooks have multiple ISBNs from multiple editions to merge incorrectly
        • 'Transitive merges' allows for leveraging of metadata from 'Enrichment records', which usually include supplemental metadata such as the full text of articles used for search relevancy, and citation counts from Scopus or Web of Science.  These 'Enrichment records' are only merged with their 'Parent record' or 'Metadata record', which can then merge with other records for the same citation
    • Filters to block matching
      • Most common
        • Mismatched DOI or PMIDs
        • Same URL, if the records are from the same content provider and received at the same time
        • Language - by language in the metadata directly or the language detected by the system
      • Content types
        • See 'Types of records' section above, for a list of types always excluded from matching
        • The key for merging is metadata, not content type, and other than types always excluded from matching and other identifier & type mixes detailed in this guide, there are no specific restrictions to stop records of different types from merging
        • For example, there is no rule to stop a Text Resource from merging from an Article, or an Article from merging with a Review
      • Additional
        • June 2020 detail removed: Government Document type is not merged with Book type
        • June 2020 detail removed: The URLs are not the same
        • June 2020 detail added: Newspaper Articles with a date of publication prior to January 1, 2000 will never be merged
        • The source of the content is an Institutional Repository
        • "Bad Titles" - a list maintained by Ex Libris of very simple, short, or generic titles, due to high probability of false positives
        • "Overmatch" filter - titles which occur more than 4,000 times in the index
      • Also, some content providers specifically request that their records be prevented from merging with other records in the index. These have a 'Record exclusion flag' set, which excludes the records from the process. This flag can also be set by Ex Libris choice
        • Artstor
        • CABI Direct content
        • CAIRN International Journals
    • Data surfaced in UI
      • The metadata is surfaced from all the available records ie those returned by the query, giving the user the most complete metadata for the citation that is available for them, taking into account subscriptions and customizations
      • However, if there are multiple possible values for a given data field, only one version is selected for display, based on a predefined hierarchy of record types.  For example, if there are two records in the logical record that have the same author value but they capitalize the author's name slightly differently, only one of those values will be displayed, rather than both 
      • Participant records that the user does not have access to ie which are not active, will not display, even if they are included in the logical record
      • The process uses the logical record and if the logical record is not available, it uses the physical record
    Record IDs

    All records begin with a TN_ prepend (unknown and undocumented what TN stands for)

    eg TN_wos3452455

    Can be searched without TN_ and with the multiple character wildcard to find all the content eg wos*

    TN_ prepend remains the same, but cdi_ has been added also as a prepend

    eg TN_cdi_wos3452455

    From May 2020, can be searched without TN_ and with the multiple wildcard to find all the content eg cdi_wos*

    Permalinks

    Essentially the same as for PCI, but the record id presented via the PNX bookmarklet may change more dynamically when the content is not unique, as Primo will chose one of the source recordids from the participant records at random

    Linking scenarios
    • The preferred record is Link Resolver: All matched Alma service/s links are presented
    • The preferred record is Link in Record: The static link in the PNX is presented
    • Opening the 'See all' group and the full record display of each of the source records may display varying links, dependent on the success of the OpenURL matching to the source data for Link Resolver collections, and showing the single static link for Link in Record collections

     

    When a user clicks on a matched and merged record, the API selects what it determines to be the most reliable link available, based on availability rights by subscriptions, and the available record metadata

    The 'most reliable link' is a preference for using a URL directly within the record or a URL derived from the record's metadata. If this is not possible, or if the availability rights by subscriptions do not indicate access is possible, then an OpenURL query is sent by Alma Link Resolver

    • All participant records are from Link Resolver collections (estimated at 60%):
      • The logical record will use the Link Resolver, and the behaviour is unchanged
    • All participant records are from Link in Record collections (estimated at 40%):
      • Only one link will be displayed, with priority to the most reliable link by collections which use a unique document ID from the source platform
    • The linking type for the participant records are a mix of active Link Resolver and Link in Record collections (estimated as rare):
      • Link in Record takes precedence over Link Resolver (which use OpenURL)
      • Link in Record links which use a unique document ID will be preferenced overall
      • This will change in July 2020 so that
        • If a Link in Record collection is Open Access: Link Resolver is used
        • If a Link in Record collection is not Open Access: Link in Record is used
    Display variations  

    Overall: Documentation indicates that identical values in the same field will be deduped for display, search, "and so forth", so that only one instance of the data is displayed

    The index consolidation work also included decisions per collection between PCI and Summon where the content was aligned, as to which collection to use going forward. Unfortunately much supplementary data such as Source information and backlinks in PCI collections were not present in Summon collections, and was lost in order to benefit from other metadata considered to be higher quality, such as Subjects

    Practical outcome noted:

    • Subject field for display: Subject headings are not deduplicated for capitalization variations, for example medical procedure, Medical procedure, Medical Procedure and MEDICAL PROCEDURE will all be listed, if present in the source physical records. The list is also in no discernable order ie not alphabetical, and is a mix of broad and narrow terms
    • Source field for display, search, facets,, etc: This data is deliberately stripped out, or is present in the PNX but not made usable in UI, when the source is only active for search and is not active for full text rights
    • Links section for display: This data is missing for many records, even when full text rights are present
    • Identifiers used for linking: Issues have been noted with incorrect matching, with a root cause advised of the correct identifier not present in the record, which is contrary to documented behaviour of all metadata included in the logical record
    PNX indications

    FRBR Grouping information is also available in the PNX to show if the record is part of a FRBR group or not

    • If frbrType =5, then the record is part of a FRBR group
    • If frbrType =6, then the record is not part of a FRBR group
    • If frbrType =7, the field cannot indicate if the record is part of a frbr group

    FRBR Grouping information: It appears that all record may have frbrType of 5, despite being merged or not

    • Additional:
      • There is a significant apparent change with the use of PNX data on CDI
      • In PCI, bar the occasional data issue, customers can rely on the PNX as a source of truth, for example, what is in the display section is displayed in UI Brief Results and Full Record Display, what is in the facet section is presented in UI as a selectable value in the Facet list, what is in the search section is searchable to return results in UI (except for expected full text, toc, abstract extension search, shown by snippets), etc.
        • This understanding is longstanding and well-documented by Ex Libris: The PNX record (and child pages) - "The data in the PNX record is organized in sections, each section containing information for a specific purpose."
      • It is also expected that this data is married together for flow-on logical behaviour in UI ie what is searchable is also what is displayed to users, so that a user may clearly know by term highlighting why a result is returned, what is displayed in a facet such as Subject will also be shown as a Subject for display so that a user may further utilise features such as lateral links, and what is exported via addata to functions such as Export RIS will be the same as what is displayed so that users may accurately preserve records of their research, etc
      • In CDI, this source of truth is found to be no longer stable, with examples such the above not the case, plus many more, such as source data being deliberately hidden in expanded search despite being found in PNX, so that Source facet entries are not presented in UI
      • This change is described as 'apparent' because Ex Libris has failed to document or communicate this change transparently, either in what is expected now with CDI nor the reasons why this stability has been undermined. Undocumented behaviour is always going to be unexpected vs expected. Therefore, given CDI has unexpected behaviour which both fails to support Library staff in activities such as troubleshooting issues and user queries, but also results in negative outcomes directly for users in UI, this change is interpreted as regressive and a defect

     

    Feedback and suggestions




    • Was this article helpful?