Duplicate Detection Vector

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

If you are working with Primo VE and not Primo, see Understanding the Dedup and FRBR Processes (Primo VE).

The dedup vector includes the following:

Type (T). The type defines the matching rules that will be used. Currently Primo allows the following types:
- Non-serials (T=1)—for all other records (refer to The Non-Serials Vector and Algorithm).
- Serials (T=2)—for serial records (refer to The Serials Vector and Algorithm).
  
  These rules are based on the matching algorithms developed together with the California Digital Library (CDL).
- Articles (T=3)—for articles (refer to Deduplication Algorithm for Articles).
  
  If you would like to skip duplicate detection for individual records (such as analytic records for Aleph pipes), you can set this field to 99 in the normalization rules.

The Serials and Non-serials duplication detection algorithms have two phases: Candidate Selection and Record Matching. The Articles duplication detection algorithm has only a match phase.

Candidate Fields (C1-C10)—The Candidate Selection phase locates up to a set number of potential records for matching. This section in the vector is indexed in the persistence layer. The indexes are used to locate candidates.
Matching Fields (F1-F20)—During the record matching phase, fields from the Matching Fields section are compared. Fields that match are assigned weight points, as determined by the rules used. Records that cross the threshold are considered duplicates and are assigned the MatchID of the matching record.

All of the fields in the vector should be normalized. Normalization routines may be different for different sources.

The following sections describe the various vectors and matching algorithms.

Dedup vectors and keys are limited to 4000 bytes. If this limit is reached, you may receive an SQL exception error on the P_DEDUP_VECTOR table.

For more details, see Harvesting a record fails with an UncategorizedSQLException error on the P_DEDUP_VECTOR table.

The Serials Vector and Algorithm

The following types of vectors exist for serials:

Candidate
Matching

For information on how the MARC fields are mapped into Primo, see Generic MARC 21 Normalization Rules.

Serials Candidate Vector

The following table describes the fields in the Candidate vector.

Serials Candidate Vector Fields
Field ID	Field Content	Note
C1	UnivID, UnivID_invalid	This is a unique universal ID (for example, LCCN).
C2	ISSN, ISSN_invalid, ISSN_cancelled
C3	Short Title
C4	Place of Publication	Only the first occurrence is used.
C5	Single match ID	Intended for the Alma’s MMS ID or another ID that is reliable enough to serve as the sole basis for the match.

In the Candidate phase of the algorithm, there is an OR operator between the following candidate fields (C1, C2, C3). The fourth candidate field is added if many candidates are located. The fourth candidate is added with an AND operator.

If there is a match on C5, the records are considered a match and will not continue to the matching stage, which is based on the other metadata elements.

Serials Matching Vector

The following table describes the fields in the Matching vector.

Serials Matching Vector Fields
Field ID	Field Content	Note
F1	UnivID
F2	Univ_invalid	Multiple occurrences are delimited by a semicolon.
F3	ISSN	Multiple occurrences are delimited by a semicolon.
F4	ISSN_invalid	Multiple occurrences are delimited by a semicolon.
F5	ISSN_cancelled	Multiple occurrences are delimited by a semicolon.
F6	Start publication year
F7	Full title
F8	Brief title	Remove subtitle and any additional information.
F9	Country of publication
F10	Place of publication
F11	Main entry (author, corporate body, meeting)

The matching takes place in two stages, quick and full.

The quick match compares the following fields:

Single match ID
UnivID/UnivID_invalid
ISSN/ISSN_invalid/ISSN_cancelled
Full title

The full match compares all fields in the vector.

The following table lists the default weights for quick and full matches for serials. If 800 points are reached in the quick-match stage, the records are considered a match. If not, the record proceeds to the full-match stage, which checks all fields. As in the quick-match stage, if 800 points are reached, the records are considered a match.

In both the quick-match and full-match stages, the weight from the UnivID and ISSN matches is compared, and the higher of the two weights, not the sum, is assigned to the record.

For every group, only the highest weight is assigned.

Default Weights for Quick and Full Matches For Serials
Field ID	Fields for Comparison	Result	Points
F1/F2	UnivID/UnivID_invalid	Match on Univ_ID	200
		Match on UnivID_invalid	50
		Match between UnivID and UnivID_invalid	100
		No match on UnivID	-470
		No match between UnivID and UnivID_invalid	-50
		No match on UnivID_invalid	0
		Either or both records missing field	0
F3/F4/F5	ISSN/ISSN_invalid/ISSN_cancelled	Match on ISSN	200
		Match on ISSN_invalid	50
		Match on ISSN_cancelled	10
		Match ISSN and ISSN_invalid	100
		Match between ISSN and ISSN_cancelled	50
		Match between ISSN_invalid and ISSN_cancelled	30
		No match on ISSN	-250
		No match ISSN_invalid and ISSN_cancelled	0
		Either record or both records missing field	0
F7	Full Title	Exact match on title and title NOT in table of common titles	600
		Exact match on title and title IS in table of common titles	135
		Match on truncated title and truncated title in the list of common titles	135
		Match on truncated title and truncated title not in the list of common titles	175
		No match	-600
		Calculate weight based on percentage of keywords from title that match x 75	*
		Calculate weight based on percentage of keywords from title that match x 75 + 50	*
F6	Date	Exact match	225
		+/- 1 year	50
		+ /- 2 years	25
		If first three digits match, check the 4th digit and if either record has a 0	20
		No match	-150
		The value is missing from either or both records	0
F9	Country of Publication	Match	40
		No match	-20
		Either record or both records missing the value	0
F10	Place of Publication	Exact match on normalized place of publication	200
		Either or both records are missing the subfield	0
		No match on normalized place of publication	-100
F11	Main Entry	If the normalized contents of the fields match, then it is considered a full match even if the data was found in different fields	200
		If one or both main entries are missing	0
		If more than 60% of the keywords from main entry fields match and are in the same order	75 times the percentage of words that match plus 25
		If more than 60% of the keywords from main entry fields match but are not in the same order	75 times the percentage of words that match
		If 60% or less of the keywords in main entry fields match	-250

The Non-Serials Vector and Algorithm

The following types of vectors exist for non-serials:

Candidate
Matching

For information on how the MARC fields are mapped into Primo, see Generic MARC 21 Normalization Rules.

Non-Serials Candidate Vector

The following table describes the fields in the Candidate vector.

Non-Serials Candidate Vector Fields
Field ID	Field Content	Note
C1	UnivID and UnivID_invalid	A unique universal ID (for example, LCCN)
C2	ISBN, ISBN_invalid	Multiple occurrences delimited by a semicolon.
C3	Short title	The first 25 characters of the normalized title.
C4	Year
C5	Single match ID	Intended for the Alma’s MMS ID or another ID that is reliable enough to serve as the sole basis for the match.

In the Candidate algorithm, there is an OR operator between the following candidate fields (C1, C2, C3) if more than 150. The fourth candidate field is added only if too many candidates are located. The fourth candidate is added with an AND.

If there is a match on C5, the records are considered a match and will not continue to the matching stage, which is based on the other metadata elements.

Non-Serials Matching Vector

The following table describes the fields in the Matching vector.

Non-Serials Matching Vector Fields
Field ID	Field Content	Note
F1	UnivID
F2	UnivID_invalid	Multiple occurrences are delimited by a semicolon.
F3	ISBN	Multiple occurrences are delimited by a semicolon.
F4	ISBN_invalid	Multiple occurrences are delimited by a semicolon.
F5	Short title	The first 25 characters of the normalized title.
F6	Year
F7	Full title
F8	Country of publication
F9	Pagination	The highest number in the pagination field should be used.
F10	Publisher
F11	Main entry (author, corporate body, meeting)

The matching takes place in two stages: quick and full.

The quick match stage compares the following fields:

Single match ID
UnivID/UnivID_invalid
ISBN/ISBN_invalid
Short title
Year

If 850 points are reached, the records are considered a match. If not, the record proceeds to full-match stage, which uses all fields except the full title is used instead of the short title. If 875 points are reached, the records are considered a match.

In both the quick-match and full-match stages, the weight from the UnivID and ISBN matching is compared, and the higher weight of the two stages, not the sum, is assigned to the record.

For every group, only the highest weight is assigned.

Default Weights for Quick and Full Matches For Non-Serials
Fields for Comparison	Result	Points
UnivID/UnivID_invalid*	Match on valid UnivID	200
	Match on invalid UnivID	50
	Match between valid and invalid	100
	Field present in both records but no match	-320
	Either record or both records missing	0
ISBN/ISBN_invalid*	Match between valid ISBN	85
	Match between invalid ISBN	10
	Match between valid and invalid	30
	Field present in both records but no match	-225
	Either record or both records missing	0
Date	Exact match	200
	+/- 2 years	-25
	No match	-250
	Value missing	0
Short-Title	Exact match on first 25 characters	450
Short-Title	Non match	0
Full-Title	Exact match	600
	Either title contained within the other title	350
	Either title shorter than nine characters	0
	Matching keywords	450 x (% of matching words)
	Matching keywords in order	450 x (% + 50)
	Non-match	-600
Country of Publication	Exact match	40
	Either one missing	0
	Non-match	-205
Pagination	Exact match, and the value is greater than 10	100
	Exact match, and the value is less than or equal to 10	50
	Values differ by 1-10 pages, and both values are greater than 10	50
	Values differ by 1-10 pages, and either value is less than or equal to 10	20
	Non-match (values differ by more than 10 pages)	-225
Publisher	Exact match	100
	Either missing	0
	Occur within the other	100
	Non-match	-25
Main Entry	Exact match	125
	Both main entries missing	75
	Half (or more) of the main entry keywords are common and in the same order	% common keywords x 80 + 10
	Half (or more) of the main entry keywords are common, but are not in the same order	% common keywords x 80
	Present in one record but missing in the other	-25 There is a known issue with the weight of F11 fields. Primo currently gives +25 points to one missing main entry instead of -25.
	Non-match	-200

Deduplication Algorithm for Articles

The Deduplication algorithm for articles matches a single key that is also used in the candidate and matching phases. Use C1 for the candidate key and F1 for the match. In addition to remote searches, the deduplication algorithm can be used for records that are harvested into the local Primo repository. In both cases, a single key is created from the following elements:

ISSN, DOI, or normalized journal title
Start page, author, or author last name
Publication year, issue, or part
Normalized article title

In order to create a dedup key, the record must include all of the dedup key elements. Records match when the dedup keys are identical.

For information on how the MARC fields are mapped into Primo, see Generic MARC 21 Normalization Rules.

If you want to load articles into the local repository, create a dedup vector as follows:

Dedup Vector
Field ID	Field Content	Note
T	3 OR 99	Use type 99 for records that do not include all required data elements. This can be done by first creating rules that assign the type 99 to the records that do not have the following fields in the addata section. Create a separate rule for every group of element: If record does not have an ISSN, DOI, or a Journal title, use type 99. If the record does not have StartPage, author, or author last name, use type 99. If the record does not have PublicationYear, Issue, or Part, use type 99. If the record does not have an ArticleTitle, use type 99. All other records should get type 3.
C1	The match key created from the following elements as a single string: (ISSN, DOI, or Journal title) + (StartPage, author, or author last name) + (PublicationYear, issue, or part) + (ArticleTitle)
F1	The match key created from the following elements as a single string: (ISSN, DOI, or Journal title) + (StartPage, author, or author last name) + (PublicationYear, issue, or part) + (ArticleTitle)

The following rules are used to create a normalization article title:

Replace the following characters with a space: !@#$%^&*()_+-={}}[]:";<>?,./~`
Remove all blank characters.
Save the last 25 characters of the title.
Change the characters to lowercase characters.

The Deduped-Merged Record

The system creates the merged record based on the preferred record, where the fields in the following sections are merged from all records in the dedup group:

Control – most fields are merged
Display – After the source and availlibrary fields are merged, the other fields are taken from the preferred record.
Links – all fields are merged and duplicate fields are removed
Search – all fields are merged and duplicate fields are removed
Sort – only fields from preferred are taken
Facets – all fields are merged and duplicate fields are removed
Dedup – not relevant
FRBR – all fields merged and duplicate fields are removed
Delivery – all fields are merged
Ranking – the highest value is taken from all records
Enrichment – not relevant
Additional data – all fields are merged and duplicate fields are removed
Local fields – all local fields are included

To enable the system to identify the original source record, the dedup process adds a subfield O ($$O) and a subfield V ($$V). The content of $$O is the original PNX record ID, and the content of $$V is the value of the original field. The system uses $$O when it needs to link between fields that are derived from the same source PNX record - all fields with the same $$O derive from the same source record.

The $$V and $$O are added to fields from the control, display, links, and delivery sections. For example, a deduped record will have multiple <sourceid/> fields in the control section:

In this example, the value of the original control/sourceid fields is BBI, and the record IDs of the source PNX record are BBI004876460 and BBI004550753.

The following figure shows an example of a deduped-merged PNX record:

<record>

  <control>

    <sourceformat>MARC21</sourceformat>

    <sourcesystem>$$VILS$$OBBI004876460</sourcesystem>

    <sourcesystem>$$VILS$$OBBI004550753</sourcesystem>

    <recordid>dedupmrg2284018</recordid>

    <originalsourceid>$$VPRM01$$OBBI004876460</originalsourceid>

    <originalsourceid>$$VPRM01$$OBBI004550753</originalsourceid>

    <sourceid>$$VBBI$$OBBI004876460</sourceid>

    <sourceid>$$VBBI$$OBBI004550753</sourceid>

    <sourcerecordid>$$V004876460$$OBBI004876460</sourcerecordid>

    <sourcerecordid>$$V004550753$$OBBI004550753</sourcerecordid>

  </control>
  <display>

    <type>book</type>

    <title>Language development and learning to read the scientific study of how language development affects reading skill</title>

    <creator>Diane McGuinness</creator>

    <contributor>NetLibrary, Inc.</contributor>

    <publisher>Cambridge, Mass. : MIT Press</publisher>

    <creationdate>c2005</creationdate>

    <format>x, 494 p. : ill. ; 24 cm..</format>

    <identifier>$$CISBN$$V142372612X (electronic bk.)</identifier>

    <subject>Reading -- Research; Language acquisition -- Research; Electronic books</subject>

    <language>eng</language>

    <source>$$VBBI$$OBBI004876460</source>

    <source>$$VBBI$$OBBI004550753</source>

    <availlibrary>$$INORTH$$LNINTE$$Savailable$$33$$40$$5N$$60$$OBBI004876460</availlibrary>

    <availlibrary>$$ISOUTH$$LKINTE$$1Internet$$Scheck_holdings$$OBBI004876460</availlibrary>

    <availlibrary>$$ISOUTH$$LLINTE$$1Book$$Scheck_holdings$$OBBI004876460</availlibrary>

    <availlibrary>$$INORTH$$LNWILS$$1General collection$$2(LB1050.6 .M34 2005 )$$Savailable$$31$$40$$5N$$60$$OBBI004550753</availlibrary>

    <availinstitution>$$INORTH$$Savailable</availinstitution>

    <availinstitution>$$ISOUTH$$Scheck_holdings</availinstitution>

    <availpnx>available</availpnx>

  </display>
  <links>

    <linktotoc>$$Tamazon_toc$$DTable of Contents$$OBBI004876460</linktotoc>

    <linktoabstract>$$Tsyndetics_abstract$$DAbstract$$OBBI004876460</linktoabstract>

    <linktouc>$$Tamazon_uc$$DThis item in Amazon.com$$OBBI004876460</linktouc>

    <linktouc>$$Tworldcat_isbn$$DThis item in WorldCat®$$OBBI004876460</linktouc>

    <linktoexcerpt>$$Tsyndetics_excerpt$$DExcerpt from item$$OBBI004876460</linktoexcerpt>

    <openurl>$$Topenurl_journal$$OBBI004876460</openurl>

    <openurlfulltext>$$Topenurlfull_journal$$OBBI004876460</openurlfulltext>

    <linktoholdings>$$V$$TILS_holdings$$OBBI004876460</linktoholdings>

    <linktoholdings>$$V$$TILS_holdings$$OBBI004550753</linktoholdings>

    <backlink>$$V$$TILS_backlink$$DThis item in the Library Catalog$$OBBI004876460</backlink>

    <backlink>$$V$$TILS_backlink$$DThis item in the Library Catalog$$OBBI004550753</backlink>

    <linktorsrc>$$V$$Uhttp://www.netLibrary.com/summary.asp?id=138523$$Dfor Primo University Crookston access$$OBBI004876460</linktorsrc>

    <linktorsrc>$$V$$Uhttps://www.lib.umn.edu/slog.phtml?url=http://www.netLibrary.com/

summary.asp?id=138523$$DNorth Campus access$$OBBI004876460</linktorsrc>

    <linktorsrc>$$V$$Uhttp://www.netLibrary.com/summary.asp?id=138523$$DNorth Campus access$$OBBI004876460</linktorsrc>

  </links>
  <search>

    <creatorcontrib>NetLibrary, Inc.</creatorcontrib>

    <creatorcontrib>Net Library, Inc</creatorcontrib>

    <title>Language development and learning to read the scientific study of how language development affects reading skill /</title>

    <subject>Electronic books.</subject>

    <general>[electronic resource] :</general>

    <isbn>142372612X</isbn>

    <recordid>BBI004876460</recordid>

    <searchscope>SOUTH</searchscope>

    <scope>SOUTH</scope>

    <creatorcontrib>Diane McGuinness</creatorcontrib>

    <creatorcontrib>McGuinness, D</creatorcontrib>

    <creatorcontrib>Diane McGuinness.</creatorcontrib>

    <title>Language development and learning to read : the scientific study of how language development affects reading skill /</title>

    <subject>Reading Research.</subject>

    <subject>Language acquisition Research.</subject>

    <general>MIT Press,</general>

    <isbn>0262134527</isbn>

    <creationdate>2005</creationdate>

    <sourceid>BBI</sourceid>

    <recordid>BBI004550753</recordid>

    <rsrctype>book</rsrctype>

    <searchscope>NORTH</searchscope>

    <searchscope>BBI</searchscope>

    <scope>NORTH</scope>

    <scope>BBI</scope>

  </search>

  <sort>

    <creationdate>2005</creationdate>

  </sort>
  <facets>

    <collection>NINTE</collection>

    <collection>KINTE</collection>

    <collection>LINTE</collection>

    <toplevel>online_resources</toplevel>

    <creatorcontrib>NetLibrary, Inc</creatorcontrib>

    <genre>Electronic books</genre>

    <language>eng</language>

    <creationdate>2005</creationdate>

    <topic>Reading-Research</topic>

    <topic>Language acquisition-Research</topic>

    <collection>NWILS</collection>

    <toplevel>available</toplevel>

    <creatorcontrib>McGuinness, D</creatorcontrib>

    <prefilter>books</prefilter>

    <rsrctype>books</rsrctype>

    <classificationlcc>L - Education.-Theory and practice of education-Teaching (Principles and practice)-Reading (General)</classificationlcc>

  </facets>

  <dedup>

    <t>1</t>

    <c2>142372612X</c2>

    <c3>languagedevelopmentaadingskill</c3>

    <c4>2005</c4>

    <f3>142372612X</f3>

    <f5>languagedevelopmentaadingskill</f5>

    <f6>2005</f6>

    <f7>language development and learning to read the scientific study of how language development affects reading skill</f7>

    <f8>mau</f8>

    <f9>x, 494 p. :</f9>

    <f10>mit press</f10>

    <f11>mcguinness diane</f11>

  </dedup>

  <frbr>

    <t>1</t>

    <k1>$$Kmcguinness diane$$AA</k1>

    <k3>$$Klanguage development and learning to read the scientific study of how language development affects reading skill$$AT</k3>

  </frbr>
  <delivery>

    <institution>$$VNORTH$$OBBI004876460</institution>

    <institution>$$VSOUTH$$OBBI004876460</institution>

    <delcategory>$$VOnline Resource$$OBBI004876460</delcategory>

    <institution>$$VNORTH$$OBBI004550753</institution>

    <delcategory>$$VPhysical Item$$OBBI004550753</delcategory>

  </delivery>

  <enrichment>

    <classificationlcc>LB1050.6</classificationlcc>

  </enrichment>

    <ranking>

    <booster1>1</booster1>

    <booster2>1</booster2>

    </ranking>

  <addata>

    <addau>NetLibrary, Inc</addau>

    <eissn>0262134527 0765805723</eissn>

    <isbn>142372612X</isbn>

    <oclcid>61704190</oclcid>

    <btitle>Language development and learning to read the scientific study of how language development affects reading skill</btitle>

    <aulast>McGuinness</aulast>

    <aufirst>Diane</aufirst>

    <au>McGuinness, Diane</au>

    <date>2005</date>

    <risdate>c2005.</risdate>

    <isbn>0262134527</isbn>

    <format>book</format>

    <ristype>BOOK</ristype>

    <notes>Includes bibliographical references (p. [447]-477) and indexes.</notes>

    <cop>Cambridge, Mass.</cop>

    <pub>MIT Press</pub>

    <lccn>2004062118</lccn>

    <btitle>Language development and learning to read : the scientific study of how language development affects reading skill</btitle>

    <genre>book</genre>

  </addata>

</record>