Understanding the Dedup and FRBR Processes (Primo VE)
If you are working on Primo and not Primo VE, see FRBRization and Duplicate Detection Process for more details.
This topic describes the Dedup and FRBR processes, which are used by Primo VE to combine or group redundant records in the brief results.
The Dedup and FRBR Test Utility page (Configuration Menu > Discovery > Other > Dedup and FRBR Test Utility) enables you to compare and analyze records to see whether they meet the criteria (described in the following sections) to be considered duplicate records. For more information, see Dedup and FRBR Analysis Tool for Primo VE.
Duplication Detection (Dedup)
The duplication detection or matching process is based on creating a dedup vector for every Alma record (and all imported records from external sources). The vector includes all the data required by the Duplication Detection algorithm to determine if two records are equivalent. The vectors include one or more keys that identify the record. Keys are based on the metadata fields in the source record (MARC or DC records).
In the search results, records that are marked as duplicates are displayed as a single record. The metadata is displayed from the first record in the results set, and the delivery related information is created from all the records in the group.
FRBR
Primo VE records are grouped using the principles in the Functional Requirements for Bibliographic Records published by the IFLA Study Group on the Functional Requirements for Bibliographic Records. The grouping process creates a vector for every Alma record (and all imported records from external sources). This vector includes one or more keys that identify the group it represents. Keys are based on the metadata fields in the source record (MARC or DC records).
Records that have a matching key (Primo VE attempts to match all keys in the record) are added to a FRBR group and assigned a FRBR ID. Because records can belong to one FRBR group only, Primo VE terminates the grouping process for a record as soon as it is matched with a group.
The group IDs are indexed and used during searches to group search results by the same group ID. A record that belongs to a FRBR group displays the Multiple versions exist. See all versions link in the brief results, which enables users to see all records that have been grouped together. The record that displays in the brief results can be configured to display either information for a preferred record or generic information that pertains to all records in the group. After performing a search, results are grouped by FRBR ID, and after selecting the Multiple versions exist. See all versions link, the results are grouped by Dedup ID.
How Dedup and FRBR Work Together
The Dedup and FRBR matching processes are very similar, but the content of the keys is different. The system first looks for duplicate records and then looks for FRBR groups. Because the Dedup ID is used as a key for the FRBR process, a record that is Deduped is also FRBRized.
Calculating the Keys
The Dedup and FRBR keys are used to determine whether records match and are generated from specific data elements from the bibliographic record and then combined into keys. The sections below first describe how the key information in the MARC21 fields are mapped to the database in Primo VE and then describe how the elements are combined to create the Dedup and FRBR keys.
Mapping Dedup Key Fields
Per the MARC21, UNIMARC, and DC data formats, the following table shows how the fields from each source record are mapped and stored in Primo VE for Dedup keys.
For Dublin Core, the identifier keys (such as ISSN, ISBN and LCCN) must include the type of identifier as follows:
<dcterms:identifier xsi:type="dcterms:<Type_Of_Identifier>">
<Identifier>
</dcterms:identifier>
Key | Field Content | Type | MARC 21 field | DC Field | UNIMARC Field | BIBFRAME Field |
---|---|---|---|---|---|---|
C5 |
Control nr |
1+2 |
035 a,z |
MMS ID (Alma data) Not applicable for external data. |
035 a,z |
bf:identifiedBy - bf:Local - rdf:value |
F1 |
LCCN |
1+2 |
010 a |
dcterms:identifier dcterms:LCCN |
N/A |
bf:AdminMetadata - bf:identifiedBy - bf:Lccn |
F3 |
ISBN (13 or 10 digits) In cases in which a record has more than one ISBN, an F3 key is generated for each ISBN instead of appending them together. |
1 |
020 a,e 776 z |
dcterms:identifier dcterms:ISBN |
010 a |
bf:identifiedBy - bf:Isbn |
F3 |
ISSN |
2 |
022 a,e 776 x |
dcterms:identifier dcterms:ISSN |
011 a |
bf:identifiedBy - bf:Issn |
F4 |
ISBN_invalid |
1 |
020 z |
N/A |
010 z |
bf:identifiedBy - bf:Isbn - rdf:value - bf:status- bf:Status with URI "http://id.loc.gov/vocabulary/mstatus/incorrect" |
F4 |
ISSN_invalid |
2 |
022 y |
N/A |
011 y |
bf:identifiedBy - bf:Issn - rdf:value - bf:status- bf:Status with URI "http://id.loc.gov/vocabulary/mstatus/cancinv" |
F5 |
Brief Title |
1 |
245 a,b,n,p |
dc.title |
200 a |
bf:title - bf:Title |
F5 |
ISSN_cancelled |
2 |
022 z |
N/A |
011 z |
bf:identifiedBy - bf:Issn - rdf:value - bf:status- bf:Status with URI "http://id.loc.gov/vocabulary/mstatus/cancinv" |
F6 |
Start publication year |
1+2 |
008 (position 7-10) 260 c 264 c |
dc.date |
210 d |
bf:provisionActivity - bf:Publication - bf:date/ bflc:simpleDate with rdf:datatype="http://id.loc.gov/datatypes/edtf" |
F7 |
Full title |
1+2 |
245 a,b,n,p |
dc.title |
200 a,e,d,h,i |
bf:title - bf:Title |
F8 |
Country of publication |
1 |
008 (position 15-17) |
N/A |
LDR position 7 = m or c |
bf:provisionActivity - bf:Publication - bf:place |
F8 |
Brief title |
2 |
245 a
|
dc.title |
LDR position 7 = a (component part) or i (integrating resource) or s (serial) |
bf:title - bf:Title |
F9 |
Pagination |
1 |
300 a |
N/A |
LDR position 7 = m or c |
N/A |
F9 |
Country of publication |
2 |
008 (position 15-17) |
N/A |
LDR position 7 = a (component part) or i (integrating resource) or s (serial) |
bf:provisionActivity - bf:Publication - bf:place |
F10 |
Publisher |
1 |
260 b 264 b |
dcterms.publisher |
LDR position 7 = m or c |
bf:provisionActivity - bf:Publication - bflc:simpleAgent |
F10 |
Place of publication |
2 |
260 a 264 a |
N/A |
LDR position 7 = a (component part) or i (integrating resource) or s (serial) |
bf:provisionActivity - bf:Publication - bflc:simplePlace |
F11 |
Main entry (author, corporate body, meeting) |
1 |
100 a,b,c,d,q 111 a,c,d,e,n,q |
dc.creator |
LDR position 7 = m or c |
bf:contribution - bf:PrimaryContribution |
F11 |
Main entry (author, corporate body, meeting) |
2 |
110a ,b,c,d,e,n 111 a,c,d,e,n,q 130 a,d,l,m,n,o,p,r,s,t |
dc.creator |
LDR position 7 = a (component part) or i (integrating resource) or s (serial) |
bf:contribution - bf:PrimaryContribution |
F13 |
Control Number |
1+2 |
001 |
N/A |
001 |
MMSID (bf:adminMetadata - bf:AdminMetadata - bf:identifiedBy - bf:Local - rdf:value - bf:source "ALMA") |
F50* |
Work URI |
1+2 |
N/A |
N/A |
N/A |
bf:Work - RDF:about |
* – To enable this key field, open a Support ticket to activate Library of Congress Bibframe Work and Bibframe Instance in the Active Registry.
Mapping FRBR Key Fields
Per the MARC21, UNIMARC, and DC data formats, the following table shows how the fields from each source record are mapped and stored in Primo VE for FRBR keys.
Field ID | Field Content | MARC21 Field | DC Field | UNIMARC Field | BIBFRAME Field |
---|---|---|---|---|---|
T |
Always 1 |
t=1 |
t=1 |
t=1 |
t=1 |
K1 |
Author |
100 a, b, c, q OR (in case no 100, 110 and 111 exist): |
dc.creator |
700 a-d,f,p |
bf:PrimaryContribution |
K2 |
Uniform Title |
130 a, d, m, n, p, r |
dcterms.alternative |
500 a,h-n,q-s |
bf:expressionOf |
K3 |
Title |
If 245 a, b, e, f, g, n, p exists, and up to one of the following fields based on priority: 240 a, d, m, n, p, r |
dc.title |
200 a,c,e,h,i |
bf:title - bf:Title |
K4 |
Language |
Based on the subfields provided from each 041 field, the languages are sorted, concatenated, and delimited with a hyphen as follows:
Otherwise, use 008 position 35-37. |
dc.language If multiple languages are specified, use only the last value taken.
|
101 a If multiple languages are specified, use only the last value taken. |
bf:language - bf:Language |
K5 |
Control Number |
001 |
N/A
|
001 |
001 |
K50* |
Work URI |
N/A |
N/A |
N/A |
work URI |
K51* |
ExpressionOf URI |
N/A |
N/A |
N/A |
expression of URI |
* – To enable this key field, open a Support ticket to activate Library of Congress Bibframe Work and Bibframe Instance in the Active Registry.
General Key Creation Information
This section describes how the database fields are normalized and combined to create the Dedup and FRBR keys.
The path for each Dedup and FRBR field in Primo VE uses the following format:
Examples:
-
Dedup: match/f1
-
FRBR: frbr/k1
Concatenating Fields
A key can be created by concatenating several fields, which is indicated by placing a plus sign (+) between the fields. The system takes only the first available MARC field for each Field ID. In case there are multiple entries for the same field, the system creates multiple keys. If the key is generated from more than one field, the system creates all the possible combinations of the field's entries.
Example for Dedup concatenation (key definition = match/f1 + match/f7):
-
f1: Includes the following entries: a, b
-
f7: Includes the following entries: c, d
-
Keys created: ac, ad, bc, and bd
Normalization Methods
As keys are calculated, any of the following normalization methods may be applied to the field's values using the <method>(<field name>) format:
Normalization Method | Description |
---|---|
FUZZY_STRING |
Uses the first five words of the field's value. |
ROUND_NUMBER |
Rounds the last digit of the field's value down to 0. Examples:
|
REMOVE_COMMON_WORDS |
Removes the following common words from the field's value: advertiser, anales, annual budget, annual financial report, annual report, annual reports, annual report for, annual report for the fiscal year ended, biennial report, boletin, budget, bulletin, calendar, catalogue, circular, comprehensive annual financial report, !compilation of selected acts, compilation of selected acts within the jurisdiction of the committee on commerce, conference proceedings, conference record, directory, fact sheet, final budget, financial report, general catalog, jahresbericht, journal, language sciences, laws, etc, legislative calendar, legislative summary, membership directory, memoires, memoria, minutes, mitteilungen, monograph, monthly bulletin, news release, newsletter, occasional paper, occasional papers, proceedings, proceedings of the annual meeting, progress report, proposed budget, publicaciones, publication, publications, rapport, rapport annuel, report, report and accounts, research report, revista, semi annual report to the congress, semiannual report to the congress, session laws, statistical report, technical bulletin, technical report, transactions, travaux, trudy, update, veroeffentlichungen, veroffentlichungen, works, year book, yearbook, telephone directory |
SPLIT |
When a record has multiple identifiers (such as ISSN/ISBN), this method creates separate keys for each identifier so that the Dedup process can match other records that have one or more of the same identifiers. Otherwise, duplicate records would have to match all the identifiers. |
Optional Fields
When creating a key from several fields, some fields do not have to exist to create the key and are considered optional. Optional fields are enclosed by square brackets ([]). For example: match/f1 + [match/f7]
Key Priority
Every key has a priority, which may be one of the following: high, medium, low, or none (less than low). Refer to the key definitions below to see how it affects the matching algorithm.
Key Type
Every key has a type that indicates to the system which key definitions to use for the record.
For Dedup, the type is defined in the match/t field, which can contain one of the following values:
-
1 – Indicates that the record is not a serial record.
-
2 – Indicates that the record is a serial record.
For FRBR, the type is defined in the frbr/t field, which can contain one of the following values:
-
99 – Indicates that the record has been suppressed using the Prevent FRBR and/or Dedup in Discovery job on the Run a Job page (Admin > Manage Jobs and Sets > Run a Job).
-
1 – Indicates that the record has not been suppressed and should be FRBRized.
Dedup Key Definitions
The following table shows how each field in Primo VE is normalized and concatenated to form each Dedup key based on the record type.
Complete key | Type | Description |
---|---|---|
match/c5 |
1 |
External system number |
match/f1 + match/f5 + match/f6 |
1 |
LCCN + brief title + year |
match/f1 + FUZZY_STRING(match/f7) + match/f6 |
1 |
LCCN + fuzzy title + year |
match/f1 + match/f7 + match/f6 |
1 |
LCCN + full title + year |
match/f3 + match/f5 + match/f6 |
1 |
ISBN + brief title + date |
match/f3 + FUZZY_STRING(match/f7) + match/f6 |
1 |
ISBN + fuzzy title + date |
match/f3 + match/f7 + match/f9 |
1 |
ISBN + full title + pagination |
SPLIT(match/f3) + match/f5 + match/f6 |
1 |
ISBN1 + brief title + date ISBN2 + brief title + date |
SPLIT(match/f3) + FUZZY_STRING(match/f7) + match/f6 |
1 |
ISBN1 + fuzzy title + date ISBN2 + fuzzy title + date |
SPLIT(match/f3) + match/f7 + match/f9 |
1 |
ISBN1 + full title + pagination ISBN2 + full title + pagination |
match/f4 + match/f7 + match/f6 |
1 |
Incorrect ISBN + full title + date |
match/f4 + match/f7 + match/f9 |
1 |
Incorrect ISBN + full title + pagination |
SPLIT(match/f4) + match/f7 + match/f6 |
1 |
Incorrect ISBN1 + full title + date Incorrect ISBN2 + full title + date |
SPLIT(match/f4) + match/f7 + match/f9 |
1 |
Incorrect ISBN1 + full title + pagination Incorrect ISBN2 + full title + pagination |
match/f7 + match/f11 + match/f6 + match/f9 |
1 |
full title + main entry + date + pagination |
match/f7 + match/f11 + match/f6 + ROUND_NUMBER(match/f9) |
1 |
full title + main entry + date + fuzzy pagination |
match/f7 + match/f6 + match/f10 + match/f9 + [match/f11] |
1 |
Full title + date + publisher + pagination + [main entry] |
match/f7 + match/f6 + match/f10 + ROUND_NUMBER(match/f9) + [match/f11] |
1 |
Full title + date + publisher + fuzzy pagination + [main entry] |
match/f7 + match/f6 + match/f9 + [match/f11] |
1 |
Full title + date + pagination + [main entry] |
match/f7 + match/f6 + ROUND_NUMBER(match/f9) + [match/f11] |
1 |
Full title + date + fuzzy pagination + [main entry] |
match/f7 + match/f6 + match/f10 + [match/f11] |
1 |
Full title + date + publisher + [main entry] |
match/c5 |
2 |
MMS Id |
match/f1 + match/f8 |
2 |
LCCN + brief title |
match/f3 + match/f8 |
2 |
ISSN + brief title |
SPLIT(match/f3) + match/f8 |
1 |
ISSN1 + brief title ISSN2 + brief title |
REMOVE_COMMON_WORDS(match/f7) + match/f10 + match/f9 + [match/f11] |
2 |
Full title (not in common serials list) + place of publication + country of publication + Main entry conditional |
REMOVE_COMMON_WORDS(match/f7) + match/f10 + match/f9 + [match/f11] + [SPLIT(match/f3)] |
2 |
Full title (not in common serials list) + place of publication + main heading (conditional) + ISSN1 conditional Full title (not in common serials list) + place of publication + main heading (conditional) + ISSN2 conditional |
REMOVE_COMMON_WORDS(match/f7) + match/f6 + match/f11 |
2 |
Full title (not in common serials list) + date + main entry |
match/f7 + match/f6 + match/f11 + match/f10 |
2 |
Full title (can be in list of common titles) + date + main entry + place of publication |
REMOVE_COMMON_WORDS(match/f7) + match/f6 + match/f11 + match/f10 |
2 |
Fuzzy title (not in list of common titles) + date + main entry + place of publication |
FRBR Key Definitions
The following table shows how each field in Primo VE is normalized and concatenated to form each FRBR key. The match/group key is not generated from any MARC21 fields. Instead, it is built dynamically from the result of the Dedup calculation, which ensures that every record that is deduped is also FRBRized (in particular, any dedup group is entirely contained within a FRBR group).
Complete key | Type | Description | Priority |
---|---|---|---|
match/group |
1 |
Dedup key |
HIGH |
frbr/k2 |
1 |
Uniform Title only |
MEDIUM |
frbr/k1 + frbr/k3 |
1 |
Author + title |
NONE |
match/group |
99 |
Dedup key |
HIGH |
Calculating the Group ID
The system hashes the keys into a number and then adds in a coefficient, which is based on the key's priority, to ensure that keys with higher priorities have larger numbers. After a group ID is chosen for a specific record, all the keys that were calculated from that record are saved with the group ID to ensure that all the keys point to that group ID. This allows giving records that have identical keys for the same group ID.
Dedup/FRBR Process Flow
The system performs the following steps for both Dedup and FRBR processes, starting with the Dedup process:
-
Calculate all the keys from the source record.
-
Find all the keys that were calculated in step 1 that already exist in the database (from previous calculations). If there are keys, the system selects the group ID that is associated with the key that has the highest value (this prioritizes keys with higher priority). If the record matches more than one group and the keys have the same priority, the system selects the group ID randomly. If there is no existing key in the database, the system selects the key with the highest value and uses it as the group ID.
-
For all keys calculated in step 1, the system stores the selected group ID. This enables other records to connect to recently processed records.
The algorithm is not fully transitive, which means that two records may have a matching common record but are not considered matches themselves. Because of this, the order in which the records are processed can affect the groupings. For more information, see Dedup and Transitivity.
Dedup and Transitivity
-
Record B is added to the database and no match is found.
-
Record C is added to the database and none of its keys match record B.
-
Record A is added to the database and one of its keys matches a key in record B first. Because a match was found, the system stops processing the record and returns the following groups even if record A also matches record C: AB and C.
Suppressing FRBR/Dedup Groups
You can use the following methods to prevent records from being deduped or FRBRized:
-
Define suppression rules, which are based on types of records (such as images), which are applied during re-indexing when new records are loaded or when an existing record's metadata is modified. For more details, see Suppressing Groups of Records from Dedup and FRBR. If you want to suppress existing records, see the next option.
-
Run the Prevent FRBR and/or Dedup in Discovery job on a set of bibliographic records (Admin > Manage Jobs and Sets > Run a Job). For more information, see Running Manual Jobs on Defined Sets.
If you want to prevent the display of FRBR and Dedup groups for all records without changing any rules or running any jobs, you can disable FRBR/Dedup by configuring the Enable Dedup and Enable FRBR fields on the Brief Results tab on the View Configuration page (Configuration Menu > Discovery > Display Configuration > Configure Views). For more information, see Brief Results Tab.
Recalculating Dedup and FRBR Groups
FRBR and Dedup keys, which are built initially at record creation and/or migration, may need to be recalculated after you modify a bibliographic record. For information on how to recalculate your keys, see Recalculate Dedup and FRBR Groups.
Dedup and FRBR groups may not be recalculated after the following occur:
-
Re-indexing a bibliographic using the Recalculate Local Resource Types job.
-
The semi-annual re-indexing process has completed.