Duplicate Detection Vector
If you are working with Primo VE and not Primo, see Understanding the Dedup and FRBR Processes (Primo VE).
The dedup vector includes the following:
-
Type (T). The type defines the matching rules that will be used. Currently Primo allows the following types:
-
Non-serials (T=1)—for all other records (refer to The Non-Serials Vector and Algorithm).
-
Serials (T=2)—for serial records (refer to The Serials Vector and Algorithm).
These rules are based on the matching algorithms developed together with the California Digital Library (CDL). -
Articles (T=3)—for articles (refer to Deduplication Algorithm for Articles).
If you would like to skip duplicate detection for individual records (such as analytic records for Aleph pipes), you can set this field to 99 in the normalization rules.
-
The Serials and Non-serials duplication detection algorithms have two phases: Candidate Selection and Record Matching. The Articles duplication detection algorithm has only a match phase.
-
Candidate Fields (C1-C10)—The Candidate Selection phase locates up to a set number of potential records for matching. This section in the vector is indexed in the persistence layer. The indexes are used to locate candidates.
-
Matching Fields (F1-F20)—During the record matching phase, fields from the Matching Fields section are compared. Fields that match are assigned weight points, as determined by the rules used. Records that cross the threshold are considered duplicates and are assigned the MatchID of the matching record.
All of the fields in the vector should be normalized. Normalization routines may be different for different sources.
The following sections describe the various vectors and matching algorithms.
The Serials Vector and Algorithm
The following types of vectors exist for serials:
-
Candidate
-
Matching
For information on how the MARC fields are mapped into Primo, see Generic MARC 21 Normalization Rules.
Serials Candidate Vector
Field ID | Field Content | Note |
---|---|---|
C1 |
UnivID, UnivID_invalid |
This is a unique universal ID (for example, LCCN). |
C2 |
ISSN, ISSN_invalid, ISSN_cancelled |
|
C3 |
Short Title |
|
C4 |
Place of Publication |
Only the first occurrence is used. |
C5 |
Single match ID |
Intended for the Alma’s MMS ID or another ID that is reliable enough to serve as the sole basis for the match. |
In the Candidate phase of the algorithm, there is an OR operator between the following candidate fields (C1, C2, C3). The fourth candidate field is added if many candidates are located. The fourth candidate is added with an AND operator.
If there is a match on C5, the records are considered a match and will not continue to the matching stage, which is based on the other metadata elements.
Serials Matching Vector
Field ID | Field Content | Note |
---|---|---|
F1 |
UnivID |
|
F2 |
Univ_invalid |
Multiple occurrences are delimited by a semicolon. |
F3 |
ISSN |
Multiple occurrences are delimited by a semicolon. |
F4 |
ISSN_invalid |
Multiple occurrences are delimited by a semicolon. |
F5 |
ISSN_cancelled |
Multiple occurrences are delimited by a semicolon. |
F6 |
Start publication year |
|
F7 |
Full title |
|
F8 |
Brief title |
Remove subtitle and any additional information. |
F9 |
Country of publication |
|
F10 |
Place of publication |
|
F11 |
Main entry (author, corporate body, meeting) |
|
The matching takes place in two stages, quick and full.
The quick match compares the following fields:
-
Single match ID
-
UnivID/UnivID_invalid
-
ISSN/ISSN_invalid/ISSN_cancelled
-
Full title
The full match compares all fields in the vector.
The following table lists the default weights for quick and full matches for serials. If 800 points are reached in the quick-match stage, the records are considered a match. If not, the record proceeds to the full-match stage, which checks all fields. As in the quick-match stage, if 800 points are reached, the records are considered a match.
In both the quick-match and full-match stages, the weight from the UnivID and ISSN matches is compared, and the higher of the two weights, not the sum, is assigned to the record.
Field ID | Fields for Comparison | Result | Points |
---|---|---|---|
F1/F2
|
UnivID/UnivID_invalid
|
Match on Univ_ID
|
200
|
Match on UnivID_invalid
|
50
|
||
Match between UnivID and UnivID_invalid
|
100
|
||
No match on UnivID
|
-470
|
||
No match between UnivID and UnivID_invalid
|
-50
|
||
No match on UnivID_invalid
|
0
|
||
Either or both records missing field
|
0
|
||
F3/F4/F5
|
ISSN/ISSN_invalid/ISSN_cancelled
|
Match on ISSN
|
200
|
Match on ISSN_invalid
|
50
|
||
Match on ISSN_cancelled
|
10
|
||
Match ISSN and ISSN_invalid
|
100
|
||
Match between ISSN and ISSN_cancelled
|
50
|
||
Match between ISSN_invalid and ISSN_cancelled
|
30
|
||
No match on ISSN
|
-250
|
||
No match ISSN_invalid and ISSN_cancelled
|
0
|
||
Either record or both records missing field
|
0
|
||
F7
|
Full Title
|
Exact match on title and title NOT in table of common titles
|
600
|
Exact match on title and title IS in table of common titles
|
135
|
||
Match on truncated title and truncated title in the list of common titles
|
135
|
||
Match on truncated title and truncated title not in the list of common titles
|
175
|
||
No match
|
-600
|
||
Calculate weight based on percentage of keywords from title that match x 75
|
*
|
||
Calculate weight based on percentage of keywords from title that match x 75 + 50
|
*
|
||
F6
|
Date
|
Exact match
|
225
|
+/- 1 year
|
50
|
||
+ /- 2 years
|
25
|
||
If first three digits match, check the 4th digit and if either record has a 0
|
20
|
||
No match
|
-150
|
||
The value is missing from either or both records
|
0
|
||
F9
|
Country of Publication
|
Match
|
40
|
No match
|
-20
|
||
Either record or both records missing the value
|
0
|
||
F10
|
Place of Publication
|
Exact match on normalized place of publication
|
200
|
Either or both records are missing the subfield
|
0
|
||
No match on normalized place of publication
|
-100
|
||
F11
|
Main Entry
|
If the normalized contents of the fields match, then it is considered a full match even if the data was found in different fields
|
200
|
If one or both main entries are missing
|
0
|
||
If more than 60% of the keywords from main entry fields match and are in the same order
|
75 times the percentage of words that match plus 25
|
||
If more than 60% of the keywords from main entry fields match but are not in the same order
|
75 times the percentage of words that match
|
||
If 60% or less of the keywords in main entry fields match
|
-250
|
The Non-Serials Vector and Algorithm
-
Candidate
-
Matching
For information on how the MARC fields are mapped into Primo, see Generic MARC 21 Normalization Rules.
Non-Serials Candidate Vector
Field ID | Field Content | Note |
---|---|---|
C1
|
UnivID and UnivID_invalid
|
A unique universal ID (for example, LCCN)
|
C2
|
ISBN, ISBN_invalid
|
Multiple occurrences delimited by a semicolon.
|
C3
|
Short title
|
The first 25 characters of the normalized title.
|
C4
|
Year
|
|
C5
|
Single match ID
|
Intended for the Alma’s MMS ID or another ID that is reliable enough to serve as the sole basis for the match.
|
Non-Serials Matching Vector
Field ID | Field Content | Note |
---|---|---|
F1
|
UnivID
|
|
F2
|
UnivID_invalid
|
Multiple occurrences are delimited by a semicolon.
|
F3
|
ISBN
|
Multiple occurrences are delimited by a semicolon.
|
F4
|
ISBN_invalid
|
Multiple occurrences are delimited by a semicolon.
|
F5
|
Short title
|
The first 25 characters of the normalized title.
|
F6
|
Year
|
|
F7
|
Full title
|
|
F8
|
Country of publication
|
|
F9
|
Pagination
|
The highest number in the pagination field should be used.
|
F10
|
Publisher
|
|
F11
|
Main entry (author, corporate body, meeting)
|
-
Single match ID
-
UnivID/UnivID_invalid
-
ISBN/ISBN_invalid
-
Short title
-
Year
Fields for Comparison | Result | Points |
---|---|---|
UnivID/UnivID_invalid*
|
Match on valid UnivID
|
200
|
Match on invalid UnivID
|
50
|
|
Match between valid and invalid
|
100
|
|
Field present in both records but no match
|
-320
|
|
Either record or both records missing
|
0
|
|
ISBN/ISBN_invalid*
|
Match between valid ISBN
|
85
|
Match between invalid ISBN
|
10
|
|
Match between valid and invalid
|
30
|
|
Field present in both records but no match
|
-225
|
|
Either record or both records missing
|
0
|
|
Date
|
Exact match
|
200
|
+/- 2 years
|
-25
|
|
No match
|
-250
|
|
Value missing
|
0
|
|
Short-Title
|
Exact match on first 25 characters
|
450
|
Non match
|
0
|
|
Full-Title
|
Exact match
|
600
|
Either title contained within the other title
|
350
|
|
Either title shorter than nine characters
|
0
|
|
Matching keywords
|
450 x (% of matching words)
|
|
Matching keywords in order
|
450 x (% + 50)
|
|
Non-match
|
-600
|
|
Country of Publication
|
Exact match
|
40
|
Either one missing
|
0
|
|
Non-match
|
-205
|
|
Pagination
|
Exact match, and the value is greater than 10
|
100
|
Exact match, and the value is less than or equal to 10
|
50
|
|
Values differ by 1-10 pages, and both values are greater than 10
|
50
|
|
Values differ by 1-10 pages, and either value is less than or equal to 10
|
20
|
|
Non-match (values differ by more than 10 pages)
|
-225
|
|
Publisher
|
Exact match
|
100
|
Either missing
|
0
|
|
Occur within the other
|
100
|
|
Non-match
|
-25
|
|
Main Entry
|
Exact match
|
125
|
Both main entries missing
|
75
|
|
Half (or more) of the main entry keywords are common and in the same order
|
% common keywords x 80 + 10
|
|
Half (or more) of the main entry keywords are common, but are not in the same order
|
% common keywords x 80
|
|
Present in one record but missing in the other
|
-25
There is a known issue with the weight of F11 fields. Primo currently gives +25 points to one missing main entry instead of -25. |
|
Non-match
|
-200
|
Deduplication Algorithm for Articles
-
ISSN, DOI, or normalized journal title
-
Start page, author, or author last name
-
Publication year, issue, or part
-
Normalized article title
For information on how the MARC fields are mapped into Primo, see Generic MARC 21 Normalization Rules.
Field ID | Field Content | Note |
---|---|---|
T
|
3 OR 99
|
Use type 99 for records that do not include all required data elements. This can be done by first creating rules that assign the type 99 to the records that do not have the following fields in the addata section. Create a separate rule for every group of element:
All other records should get type 3.
|
C1
|
The match key created from the following elements as a single string:
(ISSN, DOI, or Journal title) + (StartPage, author, or author last name) + (PublicationYear, issue, or part) + (ArticleTitle)
|
|
F1
|
The match key created from the following elements as a single string:
(ISSN, DOI, or Journal title) + (StartPage, author, or author last name) + (PublicationYear, issue, or part) + (ArticleTitle)
|
-
Replace the following characters with a space: !@#$%^&*()_+-={}}[]:";<>?,./~`
-
Remove all blank characters.
-
Save the last 25 characters of the title.
-
Change the characters to lowercase characters.
The Deduped-Merged Record
-
Control – most fields are merged
-
Display – After the source and availlibrary fields are merged, the other fields are taken from the preferred record.
-
Links – all fields are merged and duplicate fields are removed
-
Search – all fields are merged and duplicate fields are removed
-
Sort – only fields from preferred are taken
-
Facets – all fields are merged and duplicate fields are removed
-
Dedup – not relevant
-
FRBR – all fields merged and duplicate fields are removed
-
Delivery – all fields are merged
-
Ranking – the highest value is taken from all records
-
Enrichment – not relevant
-
Additional data – all fields are merged and duplicate fields are removed
-
Local fields – all local fields are included
<control>
<sourceformat>MARC21</sourceformat>
<sourcesystem>$$VILS$$OBBI004876460</sourcesystem>
<sourcesystem>$$VILS$$OBBI004550753</sourcesystem>
<recordid>dedupmrg2284018</recordid>
<originalsourceid>$$VPRM01$$OBBI004876460</originalsourceid>
<originalsourceid>$$VPRM01$$OBBI004550753</originalsourceid>
<sourceid>$$VBBI$$OBBI004876460</sourceid>
<sourceid>$$VBBI$$OBBI004550753</sourceid>
<sourcerecordid>$$V004876460$$OBBI004876460</sourcerecordid>
<sourcerecordid>$$V004550753$$OBBI004550753</sourcerecordid>
</control>
<type>book</type>
<title>Language development and learning to read the scientific study of how language development affects reading skill</title>
<creator>Diane McGuinness</creator>
<contributor>NetLibrary, Inc.</contributor>
<publisher>Cambridge, Mass. : MIT Press</publisher>
<creationdate>c2005</creationdate>
<format>x, 494 p. : ill. ; 24 cm..</format>
<identifier>$$CISBN$$V142372612X (electronic bk.)</identifier>
<subject>Reading -- Research; Language acquisition -- Research; Electronic books</subject>
<language>eng</language>
<source>$$VBBI$$OBBI004876460</source>
<source>$$VBBI$$OBBI004550753</source>
<availlibrary>$$INORTH$$LNINTE$$Savailable$$33$$40$$5N$$60$$OBBI004876460</availlibrary>
<availlibrary>$$ISOUTH$$LKINTE$$1Internet$$Scheck_holdings$$OBBI004876460</availlibrary>
<availlibrary>$$ISOUTH$$LLINTE$$1Book$$Scheck_holdings$$OBBI004876460</availlibrary>
<availlibrary>$$INORTH$$LNWILS$$1General collection$$2(LB1050.6 .M34 2005 )$$Savailable$$31$$40$$5N$$60$$OBBI004550753</availlibrary>
<availinstitution>$$INORTH$$Savailable</availinstitution>
<availinstitution>$$ISOUTH$$Scheck_holdings</availinstitution>
<availpnx>available</availpnx>
</display>
<linktotoc>$$Tamazon_toc$$DTable of Contents$$OBBI004876460</linktotoc>
<linktoabstract>$$Tsyndetics_abstract$$DAbstract$$OBBI004876460</linktoabstract>
<linktouc>$$Tamazon_uc$$DThis item in Amazon.com$$OBBI004876460</linktouc>
<linktouc>$$Tworldcat_isbn$$DThis item in WorldCat®$$OBBI004876460</linktouc>
<linktoexcerpt>$$Tsyndetics_excerpt$$DExcerpt from item$$OBBI004876460</linktoexcerpt>
<openurl>$$Topenurl_journal$$OBBI004876460</openurl>
<openurlfulltext>$$Topenurlfull_journal$$OBBI004876460</openurlfulltext>
<linktoholdings>$$V$$TILS_holdings$$OBBI004876460</linktoholdings>
<linktoholdings>$$V$$TILS_holdings$$OBBI004550753</linktoholdings>
<backlink>$$V$$TILS_backlink$$DThis item in the Library Catalog$$OBBI004876460</backlink>
<backlink>$$V$$TILS_backlink$$DThis item in the Library Catalog$$OBBI004550753</backlink>
<linktorsrc>$$V$$Uhttp://www.netLibrary.com/summary.asp?id=138523$$Dfor Primo University Crookston access$$OBBI004876460</linktorsrc>
<linktorsrc>$$V$$Uhttps://www.lib.umn.edu/slog.phtml?url=http://www.netLibrary.com/
summary.asp?id=138523$$DNorth Campus access$$OBBI004876460</linktorsrc>
<linktorsrc>$$V$$Uhttp://www.netLibrary.com/summary.asp?id=138523$$DNorth Campus access$$OBBI004876460</linktorsrc>
</links>
<creatorcontrib>NetLibrary, Inc.</creatorcontrib>
<creatorcontrib>Net Library, Inc</creatorcontrib>
<title>Language development and learning to read the scientific study of how language development affects reading skill /</title>
<subject>Electronic books.</subject>
<general>[electronic resource] :</general>
<isbn>142372612X</isbn>
<recordid>BBI004876460</recordid>
<searchscope>SOUTH</searchscope>
<scope>SOUTH</scope>
<creatorcontrib>Diane McGuinness</creatorcontrib>
<creatorcontrib>McGuinness, D</creatorcontrib>
<creatorcontrib>Diane McGuinness.</creatorcontrib>
<title>Language development and learning to read : the scientific study of how language development affects reading skill /</title>
<subject>Reading Research.</subject>
<subject>Language acquisition Research.</subject>
<general>MIT Press,</general>
<isbn>0262134527</isbn>
<creationdate>2005</creationdate>
<sourceid>BBI</sourceid>
<recordid>BBI004550753</recordid>
<rsrctype>book</rsrctype>
<searchscope>NORTH</searchscope>
<searchscope>BBI</searchscope>
<scope>NORTH</scope>
<scope>BBI</scope>
</search>
<sort>
<creationdate>2005</creationdate>
</sort>
<collection>NINTE</collection>
<collection>KINTE</collection>
<collection>LINTE</collection>
<toplevel>online_resources</toplevel>
<creatorcontrib>NetLibrary, Inc</creatorcontrib>
<genre>Electronic books</genre>
<language>eng</language>
<creationdate>2005</creationdate>
<topic>Reading-Research</topic>
<topic>Language acquisition-Research</topic>
<collection>NWILS</collection>
<toplevel>available</toplevel>
<creatorcontrib>McGuinness, D</creatorcontrib>
<prefilter>books</prefilter>
<rsrctype>books</rsrctype>
<classificationlcc>L - Education.-Theory and practice of education-Teaching (Principles and practice)-Reading (General)</classificationlcc>
</facets>
<dedup>
<t>1</t>
<c2>142372612X</c2>
<c3>languagedevelopmentaadingskill</c3>
<c4>2005</c4>
<f3>142372612X</f3>
<f5>languagedevelopmentaadingskill</f5>
<f6>2005</f6>
<f7>language development and learning to read the scientific study of how language development affects reading skill</f7>
<f8>mau</f8>
<f9>x, 494 p. :</f9>
<f10>mit press</f10>
<f11>mcguinness diane</f11>
</dedup>
<frbr>
<t>1</t>
<k1>$$Kmcguinness diane$$AA</k1>
<k3>$$Klanguage development and learning to read the scientific study of how language development affects reading skill$$AT</k3>
</frbr>
<institution>$$VNORTH$$OBBI004876460</institution>
<institution>$$VSOUTH$$OBBI004876460</institution>
<delcategory>$$VOnline Resource$$OBBI004876460</delcategory>
<institution>$$VNORTH$$OBBI004550753</institution>
<delcategory>$$VPhysical Item$$OBBI004550753</delcategory>
</delivery>
<enrichment>
<classificationlcc>LB1050.6</classificationlcc>
</enrichment>
<ranking>
<booster1>1</booster1>
<booster2>1</booster2>
</ranking>
<addata>
<addau>NetLibrary, Inc</addau>
<eissn>0262134527 0765805723</eissn>
<isbn>142372612X</isbn>
<oclcid>61704190</oclcid>
<btitle>Language development and learning to read the scientific study of how language development affects reading skill</btitle>
<aulast>McGuinness</aulast>
<aufirst>Diane</aufirst>
<au>McGuinness, Diane</au>
<date>2005</date>
<risdate>c2005.</risdate>
<isbn>0262134527</isbn>
<format>book</format>
<ristype>BOOK</ristype>
<notes>Includes bibliographical references (p. [447]-477) and indexes.</notes>
<cop>Cambridge, Mass.</cop>
<pub>MIT Press</pub>
<lccn>2004062118</lccn>
<btitle>Language development and learning to read : the scientific study of how language development affects reading skill</btitle>
<genre>book</genre>
</addata>
</record>