Skip to main content
ExLibris
  • Subscribe by RSS
  • Ex Libris Knowledge Center

    Understanding the Dedup and FRBR Processes (Primo VE)

    If you are working on Primo and not Primo VE, see FRBRization and Duplicate Detection Process for more details.

    Return to menu

    This topic describes the Dedup and FRBR processes, which are used by Primo VE to combine or group redundant records in the brief results.

    The Dedup and FRBR Test Utility page (Configuration Menu > Discovery > Other > Dedup and FRBR Test Utility) enables you to compare and analyze records to see whether they meet the criteria (described in the following sections) to be considered duplicate records. For more information, see Dedup and FRBR Analysis Tool for Primo VE.

    Duplication Detection (Dedup)

    The duplication detection or matching process is based on creating a dedup vector for every Alma record (and all imported records from external sources). The vector includes all the data required by the Duplication Detection algorithm to determine if two records are equivalent. The vectors include one or more keys that identify the record. Keys are based on the metadata fields in the source record (MARC or DC records).

    In the search results, records that are marked as duplicates are displayed as a single record. The metadata is displayed from the first record in the results set, and the delivery related information is created from all the records in the group.

    FRBR

    Primo VE records are grouped using the principles in the Functional Requirements for Bibliographic Records published by the IFLA Study Group on the Functional Requirements for Bibliographic Records. The grouping process creates a vector for every record. This vector includes one or more keys that identify the group it represents. Keys are based on the metadata fields in the source record (MARC or DC records).

    Records that have a matching key (Primo VE attempts to match all keys in the record) are added to an FRBR group and assigned a FRBR ID. Because records can belong to one FRBR group only, Primo VE terminates the grouping process for a record as soon as it is matched with a group.

    The group IDs are indexed and used during searches to group search results by the same group ID. A record that belongs to an FRBR group displays the Multiple versions exist. See all versions link in the brief results, which enables users to see all records that have been grouped together. The record that displays in the brief results can be configured to display either information for a preferred record or generic information that pertains to all records in the group. After performing a search, results are grouped by FRBR ID, and after selecting the Multiple versions exist. See all versions link, the results are grouped by Dedup ID.

    How Dedup and FRBR Work Together

    The Dedup and FRBR matching processes are very similar, but the content of the keys is different. The system first looks for duplicate records and then looks for FRBR groups. Because the Dedup ID is used as a key for the FRBR process, a record that is Deduped is also FRBRized.

    Calculating the Keys

    The Dedup and FRBR keys are used to determine whether records match and are generated from specific data elements from the bibliographic record and then combined into keys. The sections below first describe how the key information in the MARC21 fields are mapped to the database in Primo VE and then describe how the elements are combined to create the Dedup and FRBR keys.

    Mapping Dedup Key Fields

    Per the MARC21, UNIMARC, and DC data formats, the following table shows how the fields from each source record are mapped and stored in Primo VE for Dedup keys.

    Key Field Content Type MARC 21 field DC Field UNIMARC Field

    C5

    Control nr

    1+2

    035 a,z

    MMS ID (Alma data)

    Not applicable for external data

    035 a,z

    F1

    LCCN

    1+2

    010 a

    dcterms:identifier dcterms:LCCN

    N/A

    F3

    ISBN (13 or 10 digits)

    In cases in which a record has more than one ISBN, an F3 key is generated for each ISBN instead of appending them together.

    1

    020 a,e

    776 z

    dcterms:identifier dcterms:ISBN

    010 a

    F3

    ISSN

    2

    022 a,e

    776 x

    dcterms:identifier dcterms:ISSN
    dcterms:identifier dc:ISSN

    011 a

    F4

    ISBN_invalid

    1

    020 z

    N/A

    010 z

    F4

    ISSN_invalid

    2

    022 y

    N/A

    011 y

    F5

    Brief Title

    1

    245 a,b,n,p

    dc.title
    dcterms.title

    200 a

    F5

    ISSN_cancelled

    2

    022 z

    N/A

    011 z

    F6

    Start publication year

    1+2

    008 (position 7-10)

    260 c

    264 c

    dc.date
    dcterms.date
    dcterms.created

    210 d
    100 a (position 9-16)

    F7

    Full title

    1+2

    245 a,b,n,p

    dc.title
    dcterms.title

    200 a,e,d,h,i

    F8

    Country of publication

    1

    008 (position 15-17)

    N/A

    LDR position 7 = m or c
    AND MARC is "102"."a"

    F8

    Brief title

    2

    245 a

     

    dc.title
    dcterms.title

    LDR position 7 = a (component part) or i (integrating resource) or s (serial)
    AND MARC is "200"."a"

    F9

    Pagination

    1

    300 a

    N/A

    LDR position 7 = m or c
    AND MARC is "215"."a"

    F9

    Country of publication

    2

    008 (position 15-17)

    N/A

    LDR position 7 = a (component part) or i (integrating resource) or s (serial)
    AND MARC.control is "102" "a"

    F10

    Publisher

    1

    260 b

    264 b

    dcterms.publisher
    dc.publisher

    LDR position 7 = m or c
    AND MARC is "210"."c"

    F10

    Place of publication

    2

    260 a

    264 a

    N/A

    LDR position 7 = a (component part) or i (integrating resource) or s (serial)
    AND MARC is "200"."a"

    F11

    Main entry (author, corporate body, meeting)

    1

    100 a,b,c,d,q  

    111 a,c,d,e,n,q 

    dc.creator
    dcterms.creator
    dcterms.contributor
    dc.contributor

    LDR position 7 = m or c
    AND 700 a,b,c,d,f 
    710 1st ind. = 1 a-h

    F11

    Main entry (author, corporate body, meeting)

    2

    110a ,b,c,d,e,n 

    111 a,c,d,e,n,q  

    130 a,d,l,m,n,o,p,r,s,t

    dc.creator
    dcterms.creator
    dcterms.contributor
    dc.contributor

    LDR position 7 = a (component part) or i (integrating resource) or s (serial)
    AND 710 1st ind. = 0 a,b,c,g,h  710 1st ind. = 1 a-h  
    500 a,b,h,i,k,l,m

    F13

    Contol Number

    1+2

    001

    N/A

    001

    Mapping FRBR Key Fields

    Per the MARC21, UNIMARC, and DC data formats, the following table shows how the fields from each source record are mapped and stored in Primo VE for FRBR keys.

    Field ID Field Content MARC21 Field DC Field UNIMARC Field

    T

    Always 1

    t=1

    t=1

    t=1

    K1

    Author

    100 a, b, c, q
    110 a, b, c, q
    111 a, b, c, n, q

    OR (in case no 100, 110 and 111 exist):
    700 a, b, c, q
    710 a, b, c, q
    711 a, b, c, n, q

    dc.creator
    dcterms.creator
    dcterms.contributor
    dc.contributor

    700 a-d,f,p
    701 a-d,f,p
    710 a-h,p
    711 a-h,p
    720 a,f
    721 a,f
    702 a-d,f,g
    712 a-h,p
    722 a,f

    K2

    Uniform Title

    130 a, d, m, n, p, r

    dcterms.alternative

    500 a,h-n,q-s

    K3

    Title

    240 a, d, m, n, p, r
    245 a, b, e, f, g, n, p
    242 a, b, f, g, n, p
    246 a, b, f, g, n, p
    247 a, b, f, g, n, p
    740 a, b, f, g, n, p

    dc.title
    dcterms.title

    200 a,c,e,h,i
    510 a,e
    512 a,e
    513 a,e,h,i
    514 a,e,
    515 a,e
    516 a,e
    517 a,e
    520 a,e,h,i
    541 a,e,h,i

    K4

    Language

    Based on the subfields provided from each 041 field, the languages are sorted, concatenated, and delimited with a hyphen as follows:

    1. If subfield a is included, use all a subfields only.

    2. Else if subfield d is included, use all d subfields only.

    3. Else if subfield e is included, use all e subfields only.

    Otherwise, use 008 position 35-37.

    dc.language
    dcterms.language

    If multiple languages are specified, use only the last value taken.

     

    101 a

    If multiple languages are specified, use only the last value taken.

    K5

    Control Number

    001

    N/A

     

    001

    General Key Creation Information

    This section describes how the database fields are normalized and combined to create the Dedup and FRBR keys.

    The path for each Dedup and FRBR field in Primo VE uses the following format:

    <Primo VE section>/<Primo VE field name>

    Examples: 

    • Dedup: match/f1

    • FRBR: frbr/k1

    Concatenating Fields

    A key can be created by concatenating several fields, which is indicated by placing a plus sign (+) between the fields.

    Example for Dedup: 

    match/f1 + match/f7

    The system takes only the first available MARC field for each Field ID. For instance in FRBR K3, if a MARC 240 field exists, only it is used to create FRBR K3, and all other fields (245, 242, and so forth) are not considered. In case there are multiple entries for the same field, the system creates multiple keys. If the key is generated from more than one field, the system creates all the possible combinations of the field's entries. For the Dedup example above:

    • f1: Includes the following entries: a, b

    • f7: Includes the following entries: c, d

    • Key definition: match/f1 + match/f7

    • Keys created: ac, ad, bc, and bd

    Normalization Methods

    As keys are calculated, any of the following normalization methods may be applied to the field's values using the <method>(<field name>) format:

    Normalization Method Description

    FUZZY_STRING

    Uses the first five words of the field's value.

    ROUND_NUMBER

    Rounds the last digit of the field's value down to 0.

    Examples:

    • 11 rounds down to 10
    • 199 rounds down to 190

    REMOVE_COMMON_WORDS

    Removes the following common words from the field's value:

    advertiser, anales, annual budget, annual financial report, annual report, annual reports, annual report for, annual report for the fiscal year ended, biennial report, boletin, budget, bulletin, calendar, catalogue, circular, comprehensive annual financial report, !compilation of selected acts, compilation of selected acts within the jurisdiction of the committee on commerce, conference proceedings, conference record, directory, fact sheet, final budget, financial report, general catalog, jahresbericht, journal, language sciences, laws, etc, legislative calendar, legislative summary, membership directory, memoires, memoria, minutes, mitteilungen, monograph, monthly bulletin, news release, newsletter, occasional paper, occasional papers, proceedings, proceedings of the annual meeting, progress report, proposed budget, publicaciones, publication, publications, rapport, rapport annuel, report, report and accounts, research report, revista, semi annual report to the congress, semiannual report to the congress, session laws, statistical report, technical bulletin, technical report, transactions, travaux, trudy, update, veroeffentlichungen, veroffentlichungen, works, year book, yearbook, telephone directory

    SPLIT

    When a record has multiple identifiers (such as ISSN/ISBN), this method creates separate keys for each identifier so that the Dedup process can match other records that have one or more of the same identifiers. Otherwise, duplicate records would have to match all the identifiers.

    Optional Fields

    When creating a key from several fields, some fields do not have to exist to create the key and are considered optional. Optional fields are enclosed by square brackets ([]). For example: match/f1 + [match/f7]

    Key Priority

    Every key has a priority, which may be one of the following: high, medium, low, or none (less than low). Refer to the key definitions below to see how it affects the matching algorithm.

    Key Type

    Every key has a type that indicates to the system which key definitions to use for the record.
     
    For Dedup, the type is defined in the match/t field, which can contain one of the following values:

    • 1 – Indicates that the record is not a serial record.

    • 2 – Indicates that the record is a serial record.

    For FRBR, the type is defined in the frbr/t field, which can contain one of the following values:

    • 99 – Indicates that the record has been suppressed using the Prevent FRBR and/or Dedup in Discovery job on the Run a Job page (Admin > Manage Jobs and Sets > Run a Job).

    • 1 – Indicates that the record has not been suppressed and should be FRBRized.

    Dedup Key Definitions

    The following table shows how each field in Primo VE is normalized and concatenated to form each Dedup key based on the record type.

    Complete key Type Description

    match/c5

    1

    External system number

    match/f1 + match/f5 + match/f6

    1

    LCCN + brief title + year

    match/f1 + FUZZY_STRING(match/f7) + match/f6

    1

    LCCN + fuzzy title + year

    match/f1 + match/f7 + match/f6

    1

    LCCN + full title + year

    match/f3 + match/f5 + match/f6

    1

    ISBN + brief title + date

    match/f3 + FUZZY_STRING(match/f7) + match/f6

    1

    ISBN + fuzzy title + date

    match/f3 + match/f7 + match/f9

    1

    ISBN + full title + pagination

    SPLIT(match/f3) + match/f5 + match/f6

    1

    ISBN1 + brief title + date

    ISBN2 + brief title + date

    SPLIT(match/f3) + FUZZY_STRING(match/f7) + match/f6

    1

    ISBN1 + fuzzy title + date

    ISBN2 + fuzzy title + date

    SPLIT(match/f3) + match/f7 + match/f9

    1

    ISBN1 + full title + pagination

    ISBN2 + full title + pagination

    match/f4 + match/f7 + match/f6

    1

    Incorrect ISBN + full title + date

    match/f4 + match/f7 + match/f9

    1

    Incorrect ISBN + full title + pagination

    SPLIT(match/f4) + match/f7 + match/f6

    1

    Incorrect ISBN1 + full title + date

    Incorrect ISBN2 + full title + date

    SPLIT(match/f4) + match/f7 + match/f9

    1

    Incorrect ISBN1 + full title + pagination

    Incorrect ISBN2 + full title + pagination

    match/f7 + match/f11 + match/f6 + match/f9

    1

    full title + main entry + date + pagination

    match/f7 + match/f11 + match/f6 + ROUND_NUMBER(match/f9)

    1

    full title + main entry + date + fuzzy pagination

    match/f7 + match/f6 + match/f10 + match/f9 + [match/f11]

    1

    Full title + date + publisher + pagination + [main entry]

    match/f7 + match/f6 + match/f10 + ROUND_NUMBER(match/f9) + [match/f11]

    1

    Full title + date + publisher + fuzzy pagination + [main entry]

    match/f7 + match/f6 + match/f9 + [match/f11]

    1

    Full title + date + pagination + [main entry]

    match/f7 + match/f6 + ROUND_NUMBER(match/f9) + [match/f11]

    1

    Full title + date + fuzzy pagination + [main entry]

    match/f7 + match/f6 + match/f10 + [match/f11]

    1

    Full title + date + publisher + [main entry]

    match/c5

    2

    MMS Id

    match/f1 + match/f8

    2

    LCCN + brief title

    match/f3 + match/f8

    2

    ISSN + brief title

    SPLIT(match/f3) + match/f8

    1

    ISSN1 + brief title

    ISSN2 + brief title

    REMOVE_COMMON_WORDS(match/f7) + match/f10 + match/f9 + [match/f11]

    2

    Full title (not in common serials list) + place of publication + country of publication + Main entry conditional

    REMOVE_COMMON_WORDS(match/f7) + match/f10 + match/f9 + [match/f11] + [SPLIT(match/f3)]

    2

    Full title (not in common serials list) + place of publication + main heading (conditional) + ISSN1 conditional

    Full title (not in common serials list) + place of publication + main heading (conditional) + ISSN2 conditional

    REMOVE_COMMON_WORDS(match/f7) + match/f6 + match/f11

    2

    Full title (not in common serials list) + date + main entry

    match/f7 + match/f6 + match/f11 + match/f10

    2

    Full title (can be in list of common titles) + date + main entry + place of publication

    REMOVE_COMMON_WORDS(match/f7) + match/f6 + match/f11 + match/f10

    2

    Fuzzy title (not in list of common titles) + date + main entry + place of publication

    FRBR Key Definitions

    The following table shows how each field in Primo VE is normalized and concatenated to form each FRBR key. The match/group key is not generated from any MARC21 fields. Instead, it is built dynamically from the result of the Dedup calculation, which ensures that every record that is deduped is also FRBRized (in particular, any dedup group is entirely contained within a FRBR group).

    Complete key Type Description Priority

    match/group

    1

    Dedup key

    HIGH

    frbr/k2

    1

    Uniform Title only

    MEDIUM

    frbr/k1 + frbr/k3

    1

    Author + title

    NONE

    match/group

    99

    Dedup key

    HIGH

    Calculating the Group ID

    The system hashes the keys into a number and then adds in a coefficient, which is based on the key's priority, to ensure that keys with higher priorities have larger numbers. After a group ID is chosen for a specific record, all the keys that were calculated from that record are saved with the group ID to ensure that all the keys point to that group ID. This allows giving records that have identical keys for the same group ID.

    Dedup/FRBR Process Flow

    The system performs the following steps for both Dedup and FRBR processes, starting with the Dedup process:

    1. Calculate all the keys from the source record.

    2. Find all the keys that were calculated in step 1 that already exist in the database (from previous calculations). If there are keys, the system selects the group ID that is associated with the key that has the highest value (this prioritizes keys with higher priority). If the record matches more than one group and the keys have the same priority, the system selects the group ID randomly. If there is no existing key in the database, the system selects the key with the highest value and uses it as the group ID.

    3. For all keys calculated in step 1, the system stores the selected group ID. This enables other records to connect to recently processed records.

    The algorithm is not fully transitive, which means that two records may have a matching common record but are not considered matches themselves. Because of this, the order in which the records are processed can affect the groupings. For more information, see Dedup and Transitivity.

    Dedup and Transitivity

    Because the Dedup algorithm takes many data elements into account when matching records, it is not fully transitive (if A equals B and B equals C, then A must equal C) and may not create a group for records A, B, and C even if the transitive rule of equality applies to the records. For example, transitivity is not supported in the following scenario:
    1. Record B is added to the database and no match is found.
    2. Record C is added to the database and none of its keys match record B.
    3. Record A is added to the database and one of its keys matches a key in record B first. Because a match was found, the system stops processing the record and returns the following groups even if record A also matches record C: AB and C.
    the Dedup Analysis Tool checks if two records match based on the record, not what is actually matched and deduped in the database. In the above scenario, it shows that record A equals record B, record A equals records C, and record B is not equal to record C.

    Suppressing FRBR/Dedup Groups

    You can use the following methods to prevent records from being deduped or FRBRized:

    • Define suppression rules, which are based on types of records (such as images), which are applied during re-indexing when new records are loaded or when an existing record's metadata is modified. For more details, see Suppressing Groups of Records from Dedup and FRBR. If you want to suppress existing records, see the next option.

    • Run the Prevent FRBR and/or Dedup in Discovery job on a set of bibliographic records (Admin > Manage Jobs and Sets > Run a Job). For more information, see Running Manual Jobs on Defined Sets.

    If you want to prevent the display of FRBR and Dedup groups for all records without changing any rules or running any jobs, you can disable FRBR/Dedup by configuring the Enable Dedup and Enable FRBR fields on the Brief Results tab on the View Configuration page (Configuration Menu > Discovery > Display Configuration > Configure Views). For more information, see Brief Results Tab.

    Recalculating Dedup and FRBR Groups

    FRBR and Dedup keys, which are built initially at record creation and/or migration, may need to be recalculated after you modify a bibliographic record. For information on how to recalculate your keys, see Recalculate Dedup and FRBR Groups.

    Dedup and FRBR groups may not be recalculated after the following occur:

    • Re-indexing a bibliographic using the Recalculate Local Resource Types job.

    • The semi-annual re-indexing process has completed.

    • Was this article helpful?