Understanding the Dedup and FRBR Processes (Primo VE)

Duplication Detection (Dedup)

The duplication detection or matching process is based on creating a dedup vector for every Alma record (and all imported records from external sources). The vector includes all the data required by the Duplication Detection algorithm to determine if two records are equivalent. The vectors include one or more keys that identify the record. Keys are based on the metadata fields in the source record (MARC or DC records).

In the search results, records that are marked as duplicates are displayed as a single record. The metadata is displayed from the first record in the results set, and the delivery related information is created from all the records in the group.

FRBR

Primo VE records are grouped using the principles in the Functional Requirements for Bibliographic Records published by the IFLA Study Group on the Functional Requirements for Bibliographic Records. The grouping process creates a vector for every Alma record (and all imported records from external sources). This vector includes one or more keys that identify the group it represents. Keys are based on the metadata fields in the source record (MARC or DC records).

Records that have a matching key (Primo VE attempts to match all keys in the record) are added to a FRBR group and assigned a FRBR ID. Because records can belong to one FRBR group only, Primo VE terminates the grouping process for a record as soon as it is matched with a group.

The group IDs are indexed and used during searches to group search results by the same group ID. A record that belongs to a FRBR group displays the Multiple versions exist. See all versions link in the brief results, which enables users to see all records that have been grouped together. The record that displays in the brief results can be configured to display either information for a preferred record or generic information that pertains to all records in the group. After performing a search, results are grouped by FRBR ID, and after selecting the Multiple versions exist. See all versions link, the results are grouped by Dedup ID.

How Dedup and FRBR Work Together

The Dedup and FRBR matching processes are very similar, but the content of the keys is different. The system first looks for duplicate records and then looks for FRBR groups. Because the Dedup ID is used as a key for the FRBR process, a record that is Deduped is also FRBRized.

Calculating the Keys

The Dedup and FRBR keys are used to determine whether records match and are generated from specific data elements from the bibliographic record and then combined into keys. The sections below first describe how the key information in the MARC21 fields are mapped to the database in Primo VE and then describe how the elements are combined to create the Dedup and FRBR keys.

Mapping Dedup Key Fields

Per the MARC21, UNIMARC, and DC data formats, the following table shows how the fields from each source record are mapped and stored in Primo VE for Dedup keys.

For Dublin Core, the identifier keys (such as ISSN, ISBN and LCCN) must include the type of identifier as follows:

<dcterms:identifier xsi:type="dcterms:<Type_Of_Identifier>">

<Identifier>

</dcterms:identifier>

Key	Field Content	Type	MARC 21 field	DC Field	UNIMARC Field	BIBFRAME Field
C5	Control nr	1+2	035 a,z	MMS ID (Alma data) Not applicable for external data.	035 a,z	bf:identifiedBy - bf:Local - rdf:value
F1	LCCN	1+2	010 a	dcterms:identifier dcterms:LCCN	N/A	bf:AdminMetadata - bf:identifiedBy - bf:Lccn
F3	ISBN (13 or 10 digits) In cases in which a record has more than one ISBN, an F3 key is generated for each ISBN instead of appending them together.	1	020 a,e 776 z	dcterms:identifier dcterms:ISBN	010 a	bf:identifiedBy - bf:Isbn
F3	ISSN	2	022 a,e 776 x	dcterms:identifier dcterms:ISSN dcterms:identifier dc:ISSN	011 a	bf:identifiedBy - bf:Issn
F4	ISBN_invalid	1	020 z	N/A	010 z	bf:identifiedBy - bf:Isbn - rdf:value - bf:status- bf:Status with URI "http://id.loc.gov/vocabulary/mstatus/incorrect"
F4	ISSN_invalid	2	022 y	N/A	011 y	bf:identifiedBy - bf:Issn - rdf:value - bf:status- bf:Status with URI "http://id.loc.gov/vocabulary/mstatus/cancinv"
F5	Brief Title	1	245 a,b,n,p	dc.title dcterms.title	200 a	bf:title - bf:Title
F5	ISSN_cancelled	2	022 z	N/A	011 z	bf:identifiedBy - bf:Issn - rdf:value - bf:status- bf:Status with URI "http://id.loc.gov/vocabulary/mstatus/cancinv"
F6	Start publication year	1+2	008 (position 7-10) 260 c 264 c	dc.date dcterms.date dcterms.created	210 d 100 a (position 9-16)	bf:provisionActivity - bf:Publication - bf:date/ bflc:simpleDate with rdf:datatype="http://id.loc.gov/datatypes/edtf"
F7	Full title	1+2	245 a,b,n,p	dc.title dcterms.title	200 a,e,d,h,i	bf:title - bf:Title
F8	Country of publication	1	008 (position 15-17)	N/A	LDR position 7 = m or c AND MARC is "102"."a"	bf:provisionActivity - bf:Publication - bf:place
F8	Brief title	2	245 a	dc.title dcterms.title	LDR position 7 = a (component part) or i (integrating resource) or s (serial) AND MARC is "200"."a"	bf:title - bf:Title
F9	Pagination	1	300 a	N/A	LDR position 7 = m or c AND MARC is "215"."a"	N/A
F9	Country of publication	2	008 (position 15-17)	N/A	LDR position 7 = a (component part) or i (integrating resource) or s (serial) AND MARC.control is "102" "a"	bf:provisionActivity - bf:Publication - bf:place
F10	Publisher	1	260 b 264 b	dcterms.publisher dc.publisher	LDR position 7 = m or c AND MARC is "210"."c"	bf:provisionActivity - bf:Publication - bflc:simpleAgent
F10	Place of publication	2	260 a 264 a	N/A	LDR position 7 = a (component part) or i (integrating resource) or s (serial) AND MARC is "200"."a"	bf:provisionActivity - bf:Publication - bflc:simplePlace
F11	Main entry (author, corporate body, meeting)	1	100 a,b,c,d,q 111 a,c,d,e,n,q	dc.creator dcterms.creator dcterms.contributor dc.contributor	LDR position 7 = m or c AND 700 a,b,c,d,f 710 1st ind. = 1 a-h	bf:contribution - bf:PrimaryContribution
F11	Main entry (author, corporate body, meeting)	2	110a ,b,c,d,e,n 111 a,c,d,e,n,q 130 a,d,l,m,n,o,p,r,s,t	dc.creator dcterms.creator dcterms.contributor dc.contributor	LDR position 7 = a (component part) or i (integrating resource) or s (serial) AND 710 1st ind. = 0 a,b,c,g,h 710 1st ind. = 1 a-h 500 a,b,h,i,k,l,m	bf:contribution - bf:PrimaryContribution
F13	Control Number	1+2	001	N/A	001	MMSID (bf:adminMetadata - bf:AdminMetadata - bf:identifiedBy - bf:Local - rdf:value - bf:source "ALMA")
F50*	Work URI	1+2	N/A	N/A	N/A	bf:Work - RDF:about

* – To enable this key field, open a Support ticket to activate Library of Congress Bibframe Work and Bibframe Instance in the Active Registry.

Mapping FRBR Key Fields

Per the MARC21, UNIMARC, and DC data formats, the following table shows how the fields from each source record are mapped and stored in Primo VE for FRBR keys.

Field ID	Field Content	MARC21 Field	DC Field	UNIMARC Field	BIBFRAME Field
T	Always 1	t=1	t=1	t=1	t=1
K1	Author	100 a, b, c, q 110 a, b, c, q 111 a, b, c, n, q OR (in case no 100, 110 and 111 exist): 700 a, b, c, q 710 a, b, c, q 711 a, b, c, n, q	dc.creator dcterms.creator dcterms.contributor dc.contributor	700 a-d,f,p 701 a-d,f,p 710 a-h,p 711 a-h,p 720 a,f 721 a,f 702 a-d,f,g 712 a-h,p 722 a,f	bf:PrimaryContribution
K2	Uniform Title	130 a, d, m, n, p, r	dcterms.alternative	500 a,h-n,q-s	bf:expressionOf
K3	Title	If 245 a, b, e, f, g, n, p exists, and up to one of the following fields based on priority: 240 a, d, m, n, p, r 242 a, b, f, g, n, p 246 a, b, f, g, n, p 247 a, b, f, g, n, p 740 a, b, f, g, n, p	dc.title dcterms.title	200 a,c,e,h,i 510 a,e 512 a,e 513 a,e,h,i 514 a,e, 515 a,e 516 a,e 517 a,e 520 a,e,h,i 541 a,e,h,i	bf:title - bf:Title
K4	Language	Based on the subfields provided from each 041 field, the languages are sorted, concatenated, and delimited with a hyphen as follows: If subfield a is included, use all a subfields only. Else if subfield d is included, use all d subfields only. Else if subfield e is included, use all e subfields only. Otherwise, use 008 position 35-37.	dc.language dcterms.language If multiple languages are specified, use only the last value taken.	101 a If multiple languages are specified, use only the last value taken.	bf:language - bf:Language
K5	Control Number	001	N/A	001	001
K50*	Work URI	N/A	N/A	N/A	work URI
K51*	ExpressionOf URI	N/A	N/A	N/A	expression of URI

* – To enable this key field, open a Support ticket to activate Library of Congress Bibframe Work and Bibframe Instance in the Active Registry.

General Key Creation Information

This section describes how the database fields are normalized and combined to create the Dedup and FRBR keys.

The path for each Dedup and FRBR field in Primo VE uses the following format:

Examples:

Dedup: match/f1
FRBR: frbr/k1

Concatenating Fields

A key can be created by concatenating several fields, which is indicated by placing a plus sign (+) between the fields. The system takes only the first available MARC field for each Field ID. In case there are multiple entries for the same field, the system creates multiple keys. If the key is generated from more than one field, the system creates all the possible combinations of the field's entries.

Example for Dedup concatenation (key definition = match/f1 + match/f7):

f1: Includes the following entries: a, b
f7: Includes the following entries: c, d
Keys created: ac, ad, bc, and bd

Normalization Methods

As keys are calculated, any of the following normalization methods may be applied to the field's values using the <method>(<field name>) format:

Normalization Method	Description
FUZZY_STRING	Uses the first five words of the field's value.
ROUND_NUMBER	Rounds the last digit of the field's value down to 0. Examples: 11 rounds down to 10 199 rounds down to 190
REMOVE_COMMON_WORDS	Removes the following common words from the field's value: advertiser, anales, annual budget, annual financial report, annual report, annual reports, annual report for, annual report for the fiscal year ended, biennial report, boletin, budget, bulletin, calendar, catalogue, circular, comprehensive annual financial report, !compilation of selected acts, compilation of selected acts within the jurisdiction of the committee on commerce, conference proceedings, conference record, directory, fact sheet, final budget, financial report, general catalog, jahresbericht, journal, language sciences, laws, etc, legislative calendar, legislative summary, membership directory, memoires, memoria, minutes, mitteilungen, monograph, monthly bulletin, news release, newsletter, occasional paper, occasional papers, proceedings, proceedings of the annual meeting, progress report, proposed budget, publicaciones, publication, publications, rapport, rapport annuel, report, report and accounts, research report, revista, semi annual report to the congress, semiannual report to the congress, session laws, statistical report, technical bulletin, technical report, transactions, travaux, trudy, update, veroeffentlichungen, veroffentlichungen, works, year book, yearbook, telephone directory
SPLIT	When a record has multiple identifiers (such as ISSN/ISBN), this method creates separate keys for each identifier so that the Dedup process can match other records that have one or more of the same identifiers. Otherwise, duplicate records would have to match all the identifiers.

Optional Fields

When creating a key from several fields, some fields do not have to exist to create the key and are considered optional. Optional fields are enclosed by square brackets ([]). For example: match/f1 + [match/f7]

Key Priority

Every key has a priority, which may be one of the following: high, medium, low, or none (less than low). Refer to the key definitions below to see how it affects the matching algorithm.

Key Type

Every key has a type that indicates to the system which key definitions to use for the record.

For Dedup, the type is defined in the match/t field, which can contain one of the following values:

1 – Indicates that the record is not a serial record.
2 – Indicates that the record is a serial record.

For FRBR, the type is defined in the frbr/t field, which can contain one of the following values:

99 – Indicates that the record has been suppressed using the Prevent FRBR and/or Dedup in Discovery job on the Run a Job page (Admin > Manage Jobs and Sets > Run a Job).
1 – Indicates that the record has not been suppressed and should be FRBRized.

Dedup Key Definitions

The following table shows how each field in Primo VE is normalized and concatenated to form each Dedup key based on the record type.

Complete key	Type	Description
match/c5	1	External system number
match/f1 + match/f5 + match/f6	1	LCCN + brief title + year
match/f1 + FUZZY_STRING(match/f7) + match/f6	1	LCCN + fuzzy title + year
match/f1 + match/f7 + match/f6	1	LCCN + full title + year
match/f3 + match/f5 + match/f6	1	ISBN + brief title + date
match/f3 + FUZZY_STRING(match/f7) + match/f6	1	ISBN + fuzzy title + date
match/f3 + match/f7 + match/f9	1	ISBN + full title + pagination
SPLIT(match/f3) + match/f5 + match/f6	1	ISBN1 + brief title + date ISBN2 + brief title + date
SPLIT(match/f3) + FUZZY_STRING(match/f7) + match/f6	1	ISBN1 + fuzzy title + date ISBN2 + fuzzy title + date
SPLIT(match/f3) + match/f7 + match/f9	1	ISBN1 + full title + pagination ISBN2 + full title + pagination
match/f4 + match/f7 + match/f6	1	Incorrect ISBN + full title + date
match/f4 + match/f7 + match/f9	1	Incorrect ISBN + full title + pagination
SPLIT(match/f4) + match/f7 + match/f6	1	Incorrect ISBN1 + full title + date Incorrect ISBN2 + full title + date
SPLIT(match/f4) + match/f7 + match/f9	1	Incorrect ISBN1 + full title + pagination Incorrect ISBN2 + full title + pagination
match/f7 + match/f11 + match/f6 + match/f9	1	full title + main entry + date + pagination
match/f7 + match/f11 + match/f6 + ROUND_NUMBER(match/f9)	1	full title + main entry + date + fuzzy pagination
match/f7 + match/f6 + match/f10 + match/f9 + [match/f11]	1	Full title + date + publisher + pagination + [main entry]
match/f7 + match/f6 + match/f10 + ROUND_NUMBER(match/f9) + [match/f11]	1	Full title + date + publisher + fuzzy pagination + [main entry]
match/f7 + match/f6 + match/f9 + [match/f11]	1	Full title + date + pagination + [main entry]
match/f7 + match/f6 + ROUND_NUMBER(match/f9) + [match/f11]	1	Full title + date + fuzzy pagination + [main entry]
match/f7 + match/f6 + match/f10 + [match/f11]	1	Full title + date + publisher + [main entry]
match/c5	2	MMS Id
match/f1 + match/f8	2	LCCN + brief title
match/f3 + match/f8	2	ISSN + brief title
SPLIT(match/f3) + match/f8	1	ISSN1 + brief title ISSN2 + brief title
REMOVE_COMMON_WORDS(match/f7) + match/f10 + match/f9 + [match/f11]	2	Full title (not in common serials list) + place of publication + country of publication + Main entry conditional
REMOVE_COMMON_WORDS(match/f7) + match/f10 + match/f9 + [match/f11] + [SPLIT(match/f3)]	2	Full title (not in common serials list) + place of publication + main heading (conditional) + ISSN1 conditional Full title (not in common serials list) + place of publication + main heading (conditional) + ISSN2 conditional
REMOVE_COMMON_WORDS(match/f7) + match/f6 + match/f11	2	Full title (not in common serials list) + date + main entry
match/f7 + match/f6 + match/f11 + match/f10	2	Full title (can be in list of common titles) + date + main entry + place of publication
REMOVE_COMMON_WORDS(match/f7) + match/f6 + match/f11 + match/f10	2	Fuzzy title (not in list of common titles) + date + main entry + place of publication

FRBR Key Definitions

The following table shows how each field in Primo VE is normalized and concatenated to form each FRBR key. The match/group key is not generated from any MARC21 fields. Instead, it is built dynamically from the result of the Dedup calculation, which ensures that every record that is deduped is also FRBRized (in particular, any dedup group is entirely contained within a FRBR group).

Complete key	Type	Description	Priority
match/group	1	Dedup key	HIGH
frbr/k2	1	Uniform Title only	MEDIUM
frbr/k1 + frbr/k3	1	Author + title	NONE
match/group	99	Dedup key	HIGH

Calculating the Group ID

The system hashes the keys into a number and then adds in a coefficient, which is based on the key's priority, to ensure that keys with higher priorities have larger numbers. After a group ID is chosen for a specific record, all the keys that were calculated from that record are saved with the group ID to ensure that all the keys point to that group ID. This allows giving records that have identical keys for the same group ID.

Dedup/FRBR Process Flow

The system performs the following steps for both Dedup and FRBR processes, starting with the Dedup process:

Calculate all the keys from the source record.
Find all the keys that were calculated in step 1 that already exist in the database (from previous calculations). If there are keys, the system selects the group ID that is associated with the key that has the highest value (this prioritizes keys with higher priority). If the record matches more than one group and the keys have the same priority, the system selects the group ID randomly. If there is no existing key in the database, the system selects the key with the highest value and uses it as the group ID.
For all keys calculated in step 1, the system stores the selected group ID. This enables other records to connect to recently processed records.

The algorithm is not fully transitive, which means that two records may have a matching common record but are not considered matches themselves. Because of this, the order in which the records are processed can affect the groupings. For more information, see Dedup and Transitivity.

Dedup and Transitivity

Because the Dedup algorithm takes many data elements into account when matching records, it is not fully transitive (if A equals B and B equals C, then A must equal C) and may not create a group for records A, B, and C even if the transitive rule of equality applies to the records. For example, transitivity is not supported in the following scenario:

Record B is added to the database and no match is found.
Record C is added to the database and none of its keys match record B.
Record A is added to the database and one of its keys matches a key in record B first. Because a match was found, the system stops processing the record and returns the following groups even if record A also matches record C: AB and C.

the Dedup Analysis Tool checks if two records match based on the record, not what is actually matched and deduped in the database. In the above scenario, it shows that record A equals record B, record A equals records C, and record B is not equal to record C.

Suppressing FRBR/Dedup Groups

You can use the following methods to prevent records from being deduped or FRBRized:

Define suppression rules, which are based on types of records (such as images), which are applied during re-indexing when new records are loaded or when an existing record's metadata is modified. For more details, see Suppressing Groups of Records from Dedup and FRBR. If you want to suppress existing records, see the next option.
Run the Prevent FRBR and/or Dedup in Discovery job on a set of bibliographic records (Admin > Manage Jobs and Sets > Run a Job). For more information, see Running Manual Jobs on Defined Sets.

If you want to prevent the display of FRBR and Dedup groups for all records without changing any rules or running any jobs, you can disable FRBR/Dedup by configuring the Enable Dedup and Enable FRBR fields on the Brief Results tab on the View Configuration page (Configuration Menu > Discovery > Display Configuration > Configure Views). For more information, see Brief Results Tab.

Recalculating Dedup and FRBR Groups

FRBR and Dedup keys, which are built initially at record creation and/or migration, may need to be recalculated after you modify a bibliographic record. For information on how to recalculate your keys, see Recalculate Dedup and FRBR Groups.

Dedup and FRBR groups may not be recalculated after the following occur:

Re-indexing a bibliographic using the Recalculate Local Resource Types job.
The semi-annual re-indexing process has completed.