Skip to main content
ExLibris
  • Subscribe by RSS
  • Ex Libris Knowledge Center

    Understanding the Dedup and FRBR Processes (Primo VE)

    Return to menu

    This topic describes the Dedup and FRBR processes, which are used by Primo VE to combine or group redundant records in the brief results.

    The Dedup and FRBR Test Utility page (Configuration Menu > Discovery > Other > Dedup and FRBR Test Utility) allows you to compare and analyze records to see whether they meet the criteria (described in the following sections) to be considered duplicate records. For more information, see Dedup and FRBR Analysis Tool for Primo VE.

    Duplication Detection (Dedup)

    The duplication detection or matching process is based on creating a dedup vector for every Alma record. The vector includes all the data required by the Duplication Detection algorithm to determine if two records are equivalent.  The vectors include one or more keys that identify the record. Keys are based on the metadata fields in the source record (MARC record).

    In the search results, records that are marked as duplicates are displayed as a single record. The metadata is displayed from the first record in the result set. and the delivery related information is created from all the records in the group.

    FRBR

    Primo VE records are grouped using the principles in the Functional Requirements for Bibliographic Records published by the IFLA Study Group on the Functional Requirements for Bibliographic Records. The grouping process creates a vector for every record. This vector includes one or more keys that identify the group it represents. Keys are based on the metadata fields in the source record (MARC record).

    Records that have a matching key (Primo VE attempts to match all keys in the record) are added to a FRBR group and assigned a FRBR ID. Because records can belong to one FRBR group only, Primo VE terminates the grouping process for a record as soon as it is matched with a group.

    The group IDs are indexed and used during searches to group search results by the same group ID. A record that belongs to FRBR group will display the Multiple versions exist. See all versions link in the brief results, which allows users to see all records that have been grouped together. The record that displays in the brief results can be configured to display either information for a preferred record or generic information that pertains to all records in the group. After performing a search, results are grouped by FRBR ID, and after clicking the Multiple versions exist. See all versions link, the results are grouped by Dedup ID.

    How Dedup and FRBR Work Together

    The Dedup and FRBR matching processes are very similar, but the content of the keys is different. The system first looks for duplicate records and then looks for FRBR groups. Because the Dedup ID is used as a key for the FRBR process, a record that is Deduped will also be FRBRized.

    Calculating the Keys

    The Dedup and FRBR keys are used to determine whether records match and are generated from specific data elements from the bibliographic record and then combined into keys.  The sections below first describe how the key information in the MARC21 fields are mapped to the database in Primo VE and then describe how the elements are combined to create the Dedup and FRBR keys.

    Mapping Dedup Key Fields

    The following table shows how the fields from each MARC21 source record are mapped and stored in Primo VE for Dedup keys.

    Key Field Content Type MARC 21 field DC Field UNIMARC Field

    C1

    LCCN

     

    010 a,z

    dcterms:identifier dcterms:LCCN

     

    C2

    ISBN

    1

    020 a,e,z

    dcterms:identifier dcterms:ISBN

    010 a,z

    C2

    ISSN

    2

    022 a,e,y,z

    dcterms:identifier dcterms:ISSN
    dcterms:identifier dc:ISSN

    011 a,y,z

    C3

    Title

     

    245 a,b,n,p

    dc.title
    dcterms.title

    200 a,e,h,i

    C4

    Date

    1

    "LDR".MaterialType does not equal "CR" AND MARC.control is "008"

     

    dc.date
    dcterms.date
    dcterms.created

    LDR position 7 = m or c AND MARC.control is "100" a position 9-12

    C4

    Place of publication

    2

    MARC.control."LDR".MaterialType equals "CR" AND MARC is "260"["0"]."a"["0"]

    N/A

    LDR position 7 = a (component part) or i (integrating resource) or s (serial) AND MARC is "210"."a"

    C5

    MMS ID

    1+2

    035 a,z

    MMS ID (Alma data)

    Not applicable for external data

    035 a,z

    F1

    LCCN

    1+2

    010 a

    dcterms:identifier dcterms:LCCN

    N/A

    F2

    LCCN_invalid

    1+2

    010 z

     

    N/A

    F3

    ISBN (13 or 10 digits)

    In cases in which a record has more than one ISBN, an F3 key will be generated for each ISBN instead of appending them together.

    1

    020 a,e

    dcterms:identifier dcterms:ISBN

    010 a

    F3

    ISSN

    2

    022 a,e

    776 x

    dcterms:identifier dcterms:ISSN
    dcterms:identifier dc:ISSN

    011 a

    F4

    ISBN_invalid

    1

    020 z

    N/A

    010 z

    F4

    ISSN_invalid

    2

    022 y

    N/A

    011 y

    F5

    Brief Title

    1

    245 a,b,n,p

    dc.title
    dcterms.title

    200 a

    F5

    ISSN_cancelled

    2

    022 z

    N/A

    011 z

    F6

    Start publication year

    1+2

    260 c
    008 (position 7-10)

    dc.date
    dcterms.date
    dcterms.created

    210 d
    100 a (position 9-16)

    F7

    Full title

    1+2

    245 a,b,n,p

    dc.title
    dcterms.title

    200 a,e,d,h,i

    F8

    Country of publication

    1

    008 (position 15-17)

    N/A

    LDR position 7 = m or c
    AND MARC is "102"."a"

    F8

    Brief title

    2

    245 a

     

    dc.title
    dcterms.title

    LDR position 7 = a (component part) or i (integrating resource) or s (serial)
    AND MARC is "200"."a"

    F9

    Pagination

    1

    300 a

    N/A

    LDR position 7 = m or c
    AND MARC is "215"."a"

    F9

    Country of publication

    2

    008 (position 15-17)

    N/A

    LDR position 7 = a (component part) or i (integrating resource) or s (serial)
    AND MARC.control is "102" "a"

    F10

    Publisher

    1

    260 b

    dcterms.publisher
    dc.publisher

    LDR position 7 = m or c
    AND  MARC is "210"."c"

    F10

    Place of publication

    2

    260 a

    N/A

    LDR position 7 = a (component part) or i (integrating resource) or s (serial)
    AND MARC is "200"."a"

    F11

    Main entry (author, corporate body, meeting)

    1

    100 a,b,c,d,q  

    111 a,c,d,e,n,q 

    dc.creator
    dcterms.creator
    dcterms.contributor
    dc.contributor

    LDR position 7 = m or c
    AND 700 a,b,c,d,f 
    710 1st ind. = 1 a-h

    F11

    Main entry (author, corporate body, meeting)

    2

    110a ,b,c,d,e,n 

    111 a,c,d,e,n,q  

    130 a,d,l,m,n,o,p,r,s,t

    dc.creator
    dcterms.creator
    dcterms.contributor
    dc.contributor

    LDR position 7 = a (component part) or i (integrating resource) or s (serial)
    AND 710 1st ind. = 0 a,b,c,g,h  710 1st ind. = 1 a-h  
    500 a,b,h,i,k,l,m

    F12

    Control nr

     

    035 a,z

    dcterms:identifier
    dcterms:URI where URI starts with the following: http://hdl.handle.net/

    For more information about control numbers in DC, see http://www.ukoln.ac.uk/metadata/dcmi-ieee/identifiers/

    035 a,z

    F13

     

     

    MARC.control is "001"

    N/A

    MARC.control is "001"

    F14

    Other standard

     

    024 a

    N/A

    013 a
    015 a
    016 a
    017 a

    F15

    Other standard

     

    024 z

    N/A

    013 z
    015 z
    016 z
    017 z

    Mapping FRBR Key Fields

    The following table shows how the fields from each MARC21 source record are mapped and stored in Primo VE for FRBR keys.

    Field ID Field Content MARC21 Field DC Field UNIMARC Field

    T

    Always 1

    t=1

    t=1

    t=1

    K1

    Author

    100 a, b, c, q
    110 a, b, c, q
    111 a, b, c, n, q

    OR (in case no 100, 110 and 111 exist):
    700 a, b, c, q
    710 a, b, c, q
    711 a, b, c, n, q

    dc.creator
    dcterms.creator
    dcterms.contributor
    dc.contributor

    700 a-d,f,p
    701 a-d,f,p
    710 a-h,p
    711 a-h,p
    720 a,f
    721 a,f
    702 a-d,f,g
    712 a-h,p
    722 a,f

    K2

    Uniform Title

    130 a, d, m, n, p, r

    dcterms.alternative

    500 a,h-n,q-s

    K3

    Title

    240 a, d, m, n, p, r
    245 a, b, e, f, g, n, p
    242 a, b, f, g, n, p
    246 a, b, f, g, n, p
    247 a, b, f, g, n, p
    740 a, b, f, g, n, p

    dc.title
    dcterms.title

    200 a,c,e,h,i
    510 a,e
    512 a,e
    513 a,e,h,i
    514 a,e,
    515 a,e
    516 a,e
    517 a,e
    520 a,e,h,i
    541 a,e,h,i

    General Key Creation Information

    This section describes how the database fields are normalized and combined to create the Dedup and FRBR keys.

    The path for each Dedup and FRBR field in Primo VE uses the following format:

    <Primo VE section>/<Primo VE field name>

    For example: match/f1

    Concatenating Fields

    A key can be created by concatenating several fields, which is indicated by placing a plus sign (+) between the fields. For example: match/f1 + match/f7

    In case there are multiple entries for the same field, the system will create multiple keys. If the key is generated from more than one field, the system will create all of the possible combinations of the field's entries. For example:

    • k1: Includes the following entries: a, b

    • k2: Includes the following entries: c, d

    • Key definition: match/k1 + match/k2

    • Keys created: ac, ad, bc, and bd

    Normalization Methods

    As keys are calculated, any of the following normalization methods may be applied to the field's values using the <method>(<field name>) format:

    Normalization Method Description

    FUZZY_STRING

    Uses the first five words of the field's value.

    ROUND_NUMBER

    Rounds the last digit of the field's value down to 0.

    Examples:

    • 11 rounds down to 10
    • 199 rounds down to 190

    REMOVE_COMMON_WORDS

    Removes the following common words from the field's value:

    advertiser, anales, annual budget, annual financial report, annual report, annual reports, annual report for, annual report for the fiscal year ended, biennial report, boletin, budget, bulletin, calendar, catalogue, circular, comprehensive annual financial report, !compilation of selected acts, compilation of selected acts within the jurisdiction of the committee on commerce, conference proceedings, conference record, directory, fact sheet, final budget, financial report, general catalog, jahresbericht, journal, language sciences, laws etc, legislative calendar, legislative summary, membership directory, memoires, memoria, minutes,            mitteilungen, monograph, monthly bulletin, news release, newsletter, occasional paper, occasional papers, proceedings, proceedings of the annual meeting, progress report, proposed budget, publicaciones, publication, publications, rapport, rapport annuel, report, report and accounts, research report, revista, semi annual report to the congress, semiannual report to the congress, session laws, statistical report, technical bulletin, technical report, transactions, travaux, trudy, update, veroeffentlichungen, veroffentlichungen, works, year book, yearbook, telephone directory

    Optional Fields

    When creating a key from several fields, some fields do not have to exist to create the key and are considered optional. Optional fields are enclosed by square brackets ([]). For example: match/f1 + [match/f7]

    Key Priority

    Every key has a priority, which may be one of the following: high, medium or low. Refer to the key definitions below to see how it affects the matching algorithm.

    Key Type

    Every key has a type that indicates to the system which key definitions to use for the record.
     
    For Dedup, the type is defined in the match/t field, which can contains one of the following values:

    • 1 – Indicates that the record is not a serial record.

    • 2 – Indicates that the record is a serial record.

    For FRBR, the type is defined in the frbr/t field, which can contains one of the following values:

    • 99 – Indicates that the record has been suppressed using the Prevent FRBR and/or Dedup in Discovery job on the Run a Job page (Admin > Manage Jobs and Sets > Run a Job).

    • 1 – Indicates that the record has not been suppressed and should be FRBRized.

    Dedup Key Definitions

    The following table shows how each field in Primo VE is normalized and concatenated to form each Dedup key based on the record type.

    Complete key Type Description

    match/c5

    1

    External system number

    match/f1 + match/f5 + match/f6

    1

    LCCN + brief  title + year

    match/f1 + FUZZY_STRING(match/f7) +  match/f6

    1

    LCCN + fuzzy title + year

    match/f1 + match/f7 + match/f6

    1

    LCCN + full title  + year

    search/ocolc_unique_normalize_number + match/f5 + match/f6

    1

    OCLC + brief title + year

    search/ocolc_unique_normalize_number + FUZZY_STRING(match/f7) + match/f6

    1

    OCLC + fuzzy title + year

    search/ocolc_unique_normalize_number + match/f7 + match/f9

    1

    OCLC + full title + pagination

    match/f3 + match/f5 + match/f6

    1

    ISBN + brief title + date

    match/f3 + FUZZY_STRING(match/f7) + match/f6

    1

    ISBN + fuzzy title  + date

    match/f3 + match/f7 + match/f9

    1

    ISBN + full title + pagination

    match/f4 + match/f7 + match/f6

    1

    Incorrect ISBN + full title + date

    match/f4 + match/f7 + match/f9

    1

    Incorrect ISBN + full title + pagination

    match/f7 + match/f11 + match/f6 + match/f9

    1

    full title + main entry + date + pagination

    match/f7 + match/f11 + match/f6 + ROUND_NUMBER(match/f9)

    1

    full title + main entry + date + fuzzy pagination

    match/f7 + match/f6 + match/f10 + match/f9 + [match/f11]

    1

    Full title + date + publisher + pagination +  [main entry]

    match/f7 + match/f6 + match/f10 + ROUND_NUMBER(match/f9) + [match/f11]

    1

    Full title + date + publisher + fuzzy pagination + [main entry]

    match/f7 + match/f6 + match/f9 + [match/f11]

    1

    Full title + date + pagination + [main entry]

    match/f7 + match/f6 + ROUND_NUMBER(match/f9) + [match/f11]

    1

    Full title + date + fuzzy pagination + [main entry]

    match/f7 + match/f6 + match/f10 + [match/f11]

    1

    Full title + date + publisher + [main entry]

    match/c5

    2

    MMS Id

    match/f1 + match/f8

    2

    LCCN + brief title

    search/ocolc_unique_normalize_number + match/f8

    2

    OCLC + brief title

    match/f3 + match/f8

    2

    ISSN + brief title

    REMOVE_COMMON_WORDS(match/f7) + match/f10 + match/f9 + [match/f3]

    2

    Full title (not in common serials list) +  place of publication + country of publication + ISSN conditional

    REMOVE_COMMON_WORDS(match/f7) + match/f10 + match/f9 + [match/f11]

    2

    Full title (not in common serials list) +  place of publication + country of publication + Main entry  conditional

    REMOVE_COMMON_WORDS(match/f7) + match/f10 + match/f9 + [match/f11] + [match/f3]

    2

    Full title (not in common serials list)  + date + place of publication + main heading (conditional) + ISSN conditional

    REMOVE_COMMON_WORDS(match/f7) + match/f6 + match/f11

    2

    Full title (not in common serials list)+ date + main entry

    match/f7 + match/f6 + match/f11 + match/f10

    2

    Full title (can be in list of common titles) + date + main entry + place of publication

    REMOVE_COMMON_WORDS(match/f7) + match/f6 + match/f11 + match/f10

    2

    Fuzzy title (not in list of common titles) + date + main entry + place of publication

    FRBR Key Definitions

    The following table shows how each field in Primo VE is normalized and concatenated to form each FRBR key. The match/group key is not generated from any MARC21 fields. Instead, it is built dynamically from the result of the Dedup calculation, which ensures that every record that is deduped is also FRBRized (in particular, any dedup group is entirely contained within a FRBR group).

    Complete key Type Description Priority

    match/group

    1

    Dedup key

    HIGH

    frbr/k2

    1

    Uniform Title only

    MEDIUM

    frbr/k1 + frbr/k3

    1

    Author + title

    NONE

    match/group

    99

    Dedup key

    HIGH

    Calculating the Group ID

    The system hashes the keys into a number and then adds in a coefficient, which is based on the key's priority, to ensure that keys with higher priorities have larger numbers. After a group ID is chosen for a specific record, all of the keys that were calculated from that record are saved with the group ID to ensure that all of the keys point to that group ID. This allows giving records that have identical keys the same group ID.

    Dedup/FRBR Process Flow

    The system performs the following steps for both Dedup and FRBR processes, starting with the Dedup process:

    1. Calculate all of the keys from the source record.

    2. Find all of the keys that were calculated in step 1 that already exist in the database (from previous calculations). If there are keys, the system selects the group ID that is associated with the key that has the highest value (this prioritizes keys with higher priority). If the record matches more than one group and the keys have the same priority, the system will select the group ID randomly. If there is no existing key in the database, the system selects the key with the highest value and uses it as the group ID.

    3. For all keys calculated in step 1, the system will store the selected group ID. This allows other records to connect to recently processed records.

    The algorithm is not fully transitive, which means that two records may have a matching common record, but are not considered matches themselves. Because of this, the order in which the records are processed can affect the groupings. For more information, see Dedup and Transitivity.

    Dedup and Transitivity

    Because the Dedup algorithm takes many data elements into account when matching records, it is not fully transitive (if A equals B and B equals C, then A must equal C) and may not create a group for records A, B, and C even if the transitive rule of equality applies to the records. For example, transitivity is not supported in the following scenario:
    1. Record B is added to the database and no match is found.
    2. Record C is added to the database and none of its keys match record B.
    3. Record A is added to the database and one of its keys match a key in record B first. Because a match was found, the system stops processing the record and returns the following groups even if record A also matches record C: AB and C.
    the Dedup Analysis Tool checks if two records match based on the record, not what is actually matched and deduped in the database. In the above scenario, it will show that record A equals record B, record A equals records C, and record B is not equal to record C.

    Suppressing FRBR/Dedup Groups

    You can use the following methods to prevent records from being deduped or FRBRized:

    • Define suppression rules, which are based on types of records (such as images), which are applied during re-indexing when new records are loaded or when an existing record's metadata is modified. For more details, see Suppressing Groups of Records from Dedup and FRBR. If you want to suppress existing records, see the next option.

    • Run the Prevent FRBR and/or Dedup in Discovery job on a set of bibliographic records (Admin > Manage Jobs and Sets > Run a Job). For more information, see Running Manual Jobs on Defined Sets.

    If you want prevent the display of FRBR and Dedup groups for all records without changing any rules or running any jobs, you can disable FRBR/Dedup by configuring the Enable Dedup and Enable FRBR fields on the Brief Results tab on the View Configuration page (Configuration Menu > Discovery > Display Configuration > Configure Views). For more information, see Brief Results Tab.

    • Was this article helpful?