Skip to main content
ExLibris
  • Subscribe by RSS
  • Ex Libris Knowledge Center

    Duplicate Detection Vector

    If you are working with Primo VE and not Primo, see Understanding the Dedup and FRBR Processes (Primo VE).

    Return to main page

    The dedup vector includes the following:

    • Type (T). The type defines the matching rules that will be used. Currently Primo allows the following types:

      • Non-serials (T=1)—for all other records (refer to The Non-Serials Vector and Algorithm).

      • Serials (T=2)—for serial records (refer to The Serials Vector and Algorithm).

        These rules are based on the matching algorithms developed together with the California Digital Library (CDL).
      • Articles (T=3)—for articles (refer to Deduplication Algorithm for Articles).

        If you would like to skip duplicate detection for individual records (such as analytic records for Aleph pipes), you can set this field to 99 in the normalization rules.

    The Serials and Non-serials duplication detection algorithms have two phases: Candidate Selection and Record Matching. The Articles duplication detection algorithm has only a match phase.

    • Candidate Fields (C1-C10)—The Candidate Selection phase locates up to a set number of potential records for matching. This section in the vector is indexed in the persistence layer. The indexes are used to locate candidates.

    • Matching Fields (F1-F20)—During the record matching phase, fields from the Matching Fields section are compared. Fields that match are assigned weight points, as determined by the rules used. Records that cross the threshold are considered duplicates and are assigned the MatchID of the matching record.

    All of the fields in the vector should be normalized. Normalization routines may be different for different sources.

    The following sections describe the various vectors and matching algorithms.

    Dedup vectors and keys are limited to 4000 bytes. If this limit is reached, you may receive an SQL exception error on the P_DEDUP_VECTOR table.

    The Serials Vector and Algorithm

    The following types of vectors exist for serials:

    • Candidate

    • Matching

    For information on how the MARC fields are mapped into Primo, see Generic MARC 21 Normalization Rules.

    Serials Candidate Vector

    The following table describes the fields in the Candidate vector.
    Serials Candidate Vector Fields
    Field ID Field Content Note

    C1

    UnivID, UnivID_invalid

    This is a unique universal ID (for example, LCCN).

    C2

    ISSN, ISSN_invalid, ISSN_cancelled

     

    C3

    Short Title

     

    C4

    Place of Publication

    Only the first occurrence is used.

    C5

    Single match ID

    Intended for the Alma’s MMS ID or another ID that is reliable enough to serve as the sole basis for the match.

    In the Candidate phase of the algorithm, there is an OR operator between the following candidate fields (C1, C2, C3). The fourth candidate field is added if many candidates are located. The fourth candidate is added with an AND operator.

    If there is a match on C5, the records are considered a match and will not continue to the matching stage, which is based on the other metadata elements.

    Serials Matching Vector

    The following table describes the fields in the Matching vector.
    Serials Matching Vector Fields
    Field ID Field Content Note

    F1

    UnivID

     

    F2

    Univ_invalid

    Multiple occurrences are delimited by a semicolon.

    F3

    ISSN

    Multiple occurrences are delimited by a semicolon.

    F4

    ISSN_invalid

    Multiple occurrences are delimited by a semicolon.

    F5

    ISSN_cancelled

    Multiple occurrences are delimited by a semicolon.

    F6

    Start publication year

     

    F7

    Full title

     

    F8

    Brief title

    Remove subtitle and any additional information.

    F9

    Country of publication

     

    F10

    Place of publication

     

    F11

    Main entry (author, corporate body, meeting)

     

    The matching takes place in two stages, quick and full.

    The quick match compares the following fields:

    • Single match ID

    • UnivID/UnivID_invalid

    • ISSN/ISSN_invalid/ISSN_cancelled

    • Full title

    The full match compares all fields in the vector.

    The following table lists the default weights for quick and full matches for serials. If 800 points are reached in the quick-match stage, the records are considered a match. If not, the record proceeds to the full-match stage, which checks all fields. As in the quick-match stage, if 800 points are reached, the records are considered a match.

    In both the quick-match and full-match stages, the weight from the UnivID and ISSN matches is compared, and the higher of the two weights, not the sum, is assigned to the record.

    For every group, only the highest weight is assigned.
    Default Weights for Quick and Full Matches For Serials
    Field ID Fields for Comparison Result Points
    F1/F2
    UnivID/UnivID_invalid
    Match on Univ_ID
    200
    Match on UnivID_invalid
    50
    Match between UnivID and UnivID_invalid
    100
    No match on UnivID
    -470
    No match between UnivID and UnivID_invalid
    -50
    No match on UnivID_invalid
    0
    Either or both records missing field
    0
    F3/F4/F5
    ISSN/ISSN_invalid/ISSN_cancelled
    Match on ISSN
    200
    Match on ISSN_invalid
    50
    Match on ISSN_cancelled
    10
    Match ISSN and ISSN_invalid
    100
    Match between ISSN and ISSN_cancelled
    50
    Match between ISSN_invalid and ISSN_cancelled
    30
    No match on ISSN
    -250
    No match ISSN_invalid and ISSN_cancelled
    0
    Either record or both records missing field
    0
    F7
    Full Title
    Exact match on title and title NOT in table of common titles
    600
    Exact match on title and title IS in table of common titles
    135
    Match on truncated title and truncated title in the list of common titles
    135
    Match on truncated title and truncated title not in the list of common titles
    175
    No match
    -600
    Calculate weight based on percentage of keywords from title that match x 75
    *
    Calculate weight based on percentage of keywords from title that match x 75 + 50
    *
    F6
    Date
    Exact match
    225
    +/- 1 year
    50
    + /- 2 years
    25
    If first three digits match, check the 4th digit and if either record has a 0
    20
    No match
    -150
    The value is missing from either or both records
    0
    F9
    Country of Publication
    Match
    40
    No match
    -20
    Either record or both records missing the value
    0
    F10
    Place of Publication
    Exact match on normalized place of publication
    200
    Either or both records are missing the subfield
    0
    No match on normalized place of publication
    -100
    F11
    Main Entry
    If the normalized contents of the fields match, then it is considered a full match even if the data was found in different fields
    200
    If one or both main entries are missing
    0
    If more than 60% of the keywords from main entry fields match and are in the same order
    75 times the percentage of words that match plus 25
    If more than 60% of the keywords from main entry fields match but are not in the same order
    75 times the percentage of words that match
    If 60% or less of the keywords in main entry fields match
    -250

    The Non-Serials Vector and Algorithm

    The following types of vectors exist for non-serials:
    • Candidate
    • Matching

    For information on how the MARC fields are mapped into Primo, see Generic MARC 21 Normalization Rules.

    Non-Serials Candidate Vector

    The following table describes the fields in the Candidate vector.
    Non-Serials Candidate Vector Fields
    Field ID Field Content Note
    C1
    UnivID and UnivID_invalid
    A unique universal ID (for example, LCCN)
    C2
    ISBN, ISBN_invalid
    Multiple occurrences delimited by a semicolon.
    C3
    Short title
    The first 25 characters of the normalized title.
    C4
    Year
     
    C5
    Single match ID
    Intended for the Alma’s MMS ID or another ID that is reliable enough to serve as the sole basis for the match.
    In the Candidate algorithm, there is an OR operator between the following candidate fields (C1, C2, C3) if more than 150. The fourth candidate field is added only if too many candidates are located. The fourth candidate is added with an AND.
    If there is a match on C5, the records are considered a match and will not continue to the matching stage, which is based on the other metadata elements.

    Non-Serials Matching Vector

    The following table describes the fields in the Matching vector.
    Non-Serials Matching Vector Fields
    Field ID Field Content Note
    F1
    UnivID
     
    F2
    UnivID_invalid
    Multiple occurrences are delimited by a semicolon.
    F3
    ISBN
    Multiple occurrences are delimited by a semicolon.
    F4
    ISBN_invalid
    Multiple occurrences are delimited by a semicolon.
    F5
    Short title
    The first 25 characters of the normalized title.
    F6
    Year
     
    F7
    Full title
     
    F8
    Country of publication
     
    F9
    Pagination
    The highest number in the pagination field should be used.
    F10
    Publisher
     
    F11
    Main entry (author, corporate body, meeting)
     
    The matching takes place in two stages: quick and full.
    The quick match stage compares the following fields:
    • Single match ID
    • UnivID/UnivID_invalid
    • ISBN/ISBN_invalid
    • Short title
    • Year
    If 850 points are reached, the records are considered a match. If not, the record proceeds to full-match stage, which uses all fields except the full title is used instead of the short title. If 875 points are reached, the records are considered a match.
    In both the quick-match and full-match stages, the weight from the UnivID and ISBN matching is compared, and the higher weight of the two stages, not the sum, is assigned to the record.
    For every group, only the highest weight is assigned.
    Default Weights for Quick and Full Matches For Non-Serials
    Fields for Comparison Result Points
    UnivID/UnivID_invalid*
    Match on valid UnivID
    200
    Match on invalid UnivID
    50
    Match between valid and invalid
    100
    Field present in both records but no match
    -320
    Either record or both records missing
    0
    ISBN/ISBN_invalid*
    Match between valid ISBN
    85
    Match between invalid ISBN
    10
    Match between valid and invalid
    30
    Field present in both records but no match
    -225
    Either record or both records missing
    0
    Date
    Exact match
    200
    +/- 2 years
    -25
    No match
    -250
    Value missing
    0
    Short-Title
    Exact match on first 25 characters
    450
    Non match
    0
    Full-Title
    Exact match
    600
    Either title contained within the other title
    350
    Either title shorter than nine characters
    0
    Matching keywords
    450 x (% of matching words)
    Matching keywords in order
    450 x (% + 50)
    Non-match
    -600
    Country of Publication
    Exact match
    40
    Either one missing
    0
    Non-match
    -205
    Pagination
    Exact match, and the value is greater than 10
    100
    Exact match, and the value is less than or equal to 10
    50
    Values differ by 1-10 pages, and both values are greater than 10
    50
    Values differ by 1-10 pages, and either value is less than or equal to 10
    20
    Non-match (values differ by more than 10 pages)
    -225
    Publisher
    Exact match
    100
    Either missing
    0
    Occur within the other
    100
    Non-match
    -25
    Main Entry
    Exact match
    125
    Both main entries missing
    75
    Half (or more) of the main entry keywords are common and in the same order
    % common keywords x 80 + 10
    Half (or more) of the main entry keywords are common, but are not in the same order
    % common keywords x 80
    Present in one record but missing in the other
    -25

    There is a known issue with the weight of F11 fields. Primo currently gives +25 points to one missing main entry instead of -25.

    Non-match
    -200

    Deduplication Algorithm for Articles

    The Deduplication algorithm for articles matches a single key that is also used in the candidate and matching phases. Use C1 for the candidate key and F1 for the match. In addition to remote searches, the deduplication algorithm can be used for records that are harvested into the local Primo repository. In both cases, a single key is created from the following elements:
    • ISSN, DOI, or normalized journal title
    • Start page, author, or author last name
    • Publication year, issue, or part
    • Normalized article title
    In order to create a dedup key, the record must include all of the dedup key elements. Records match when the dedup keys are identical.

    For information on how the MARC fields are mapped into Primo, see Generic MARC 21 Normalization Rules.

    If you want to load articles into the local repository, create a dedup vector as follows:
    Dedup Vector
    Field ID Field Content Note
    T
    3 OR 99
    Use type 99 for records that do not include all required data elements. This can be done by first creating rules that assign the type 99 to the records that do not have the following fields in the addata section. Create a separate rule for every group of element:
    • If record does not have an ISSN, DOI, or a Journal title, use type 99.
    • If the record does not have StartPage, author, or author last name, use type 99.
    • If the record does not have PublicationYear, Issue, or Part, use type 99.
    • If the record does not have an ArticleTitle, use type 99.
    All other records should get type 3.
    C1
    The match key created from the following elements as a single string:
    (ISSN, DOI, or Journal title) + (StartPage, author, or author last name) + (PublicationYear, issue, or part) + (ArticleTitle)
     
    F1
    The match key created from the following elements as a single string:
    (ISSN, DOI, or Journal title) + (StartPage, author, or author last name) + (PublicationYear, issue, or part) + (ArticleTitle)
     
    The following rules are used to create a normalization article title:
    • Replace the following characters with a space: !@#$%^&*()_+-={}}[]:";<>?,./~`
    • Remove all blank characters.
    • Save the last 25 characters of the title.
    • Change the characters to lowercase characters.

    The Deduped-Merged Record

    The system creates the merged record based on the preferred record, where the fields in the following sections are merged from all records in the dedup group:
    • Control – most fields are merged
    • Display – After the source and availlibrary fields are merged, the other fields are taken from the preferred record.
    • Links – all fields are merged and duplicate fields are removed
    • Search – all fields are merged and duplicate fields are removed
    • Sort – only fields from preferred are taken
    • Facets – all fields are merged and duplicate fields are removed
    • Dedup – not relevant
    • FRBR – all fields merged and duplicate fields are removed
    • Delivery – all fields are merged
    • Ranking – the highest value is taken from all records
    • Enrichment – not relevant
    • Additional data – all fields are merged and duplicate fields are removed
    • Local fields – all local fields are included
    To enable the system to identify the original source record, the dedup process adds a subfield O ($$O) and a subfield V ($$V). The content of $$O is the original PNX record ID, and the content of $$V is the value of the original field. The system uses $$O when it needs to link between fields that are derived from the same source PNX record - all fields with the same $$O derive from the same source record.
    The $$V and $$O are added to fields from the control, display, links, and delivery sections. For example, a deduped record will have multiple <sourceid/> fields in the control section:
    <sourceid>$$VBBI$$OBBI004876460</sourceid>
    <sourceid>$$VBBI$$OBBI004550753</sourceid>
    In this example, the value of the original control/sourceid fields is BBI, and the record IDs of the source PNX record are BBI004876460 and BBI004550753.
    The following figure shows an example of a deduped-merged PNX record:
    <record>
      <control>
        <sourceformat>MARC21</sourceformat>
        <sourcesystem>$$VILS$$OBBI004876460</sourcesystem>
        <sourcesystem>$$VILS$$OBBI004550753</sourcesystem>
        <recordid>dedupmrg2284018</recordid>
        <originalsourceid>$$VPRM01$$OBBI004876460</originalsourceid>
        <originalsourceid>$$VPRM01$$OBBI004550753</originalsourceid>
        <sourceid>$$VBBI$$OBBI004876460</sourceid>
        <sourceid>$$VBBI$$OBBI004550753</sourceid>
        <sourcerecordid>$$V004876460$$OBBI004876460</sourcerecordid>
        <sourcerecordid>$$V004550753$$OBBI004550753</sourcerecordid>
      </control>
      <display>
        <type>book</type>
        <title>Language development and learning to read the scientific study of how language development affects reading skill</title>
        <creator>Diane McGuinness</creator>
        <contributor>NetLibrary, Inc.</contributor>
        <publisher>Cambridge, Mass. : MIT Press</publisher>
        <creationdate>c2005</creationdate>
        <format>x, 494 p. : ill. ; 24 cm..</format>
        <identifier>$$CISBN$$V142372612X (electronic bk.)</identifier>
        <subject>Reading -- Research; Language acquisition -- Research; Electronic books</subject>
        <language>eng</language>
        <source>$$VBBI$$OBBI004876460</source>
        <source>$$VBBI$$OBBI004550753</source>
        <availlibrary>$$INORTH$$LNINTE$$Savailable$$33$$40$$5N$$60$$OBBI004876460</availlibrary>
        <availlibrary>$$ISOUTH$$LKINTE$$1Internet$$Scheck_holdings$$OBBI004876460</availlibrary>
        <availlibrary>$$ISOUTH$$LLINTE$$1Book$$Scheck_holdings$$OBBI004876460</availlibrary>
        <availlibrary>$$INORTH$$LNWILS$$1General collection$$2(LB1050.6 .M34 2005 )$$Savailable$$31$$40$$5N$$60$$OBBI004550753</availlibrary>
        <availinstitution>$$INORTH$$Savailable</availinstitution>
        <availinstitution>$$ISOUTH$$Scheck_holdings</availinstitution>
        <availpnx>available</availpnx>
      </display>
      <links>
        <linktotoc>$$Tamazon_toc$$DTable of Contents$$OBBI004876460</linktotoc>
        <linktoabstract>$$Tsyndetics_abstract$$DAbstract$$OBBI004876460</linktoabstract>
        <linktouc>$$Tamazon_uc$$DThis item in Amazon.com$$OBBI004876460</linktouc>
        <linktouc>$$Tworldcat_isbn$$DThis item in WorldCat®$$OBBI004876460</linktouc>
        <linktoexcerpt>$$Tsyndetics_excerpt$$DExcerpt from item$$OBBI004876460</linktoexcerpt>
        <openurl>$$Topenurl_journal$$OBBI004876460</openurl>
        <openurlfulltext>$$Topenurlfull_journal$$OBBI004876460</openurlfulltext>
        <linktoholdings>$$V$$TILS_holdings$$OBBI004876460</linktoholdings>
        <linktoholdings>$$V$$TILS_holdings$$OBBI004550753</linktoholdings>
        <backlink>$$V$$TILS_backlink$$DThis item in the Library Catalog$$OBBI004876460</backlink>
        <backlink>$$V$$TILS_backlink$$DThis item in the Library Catalog$$OBBI004550753</backlink>
        <linktorsrc>$$V$$Uhttp://www.netLibrary.com/summary.asp?id=138523$$Dfor Primo University Crookston access$$OBBI004876460</linktorsrc>
        <linktorsrc>$$V$$Uhttps://www.lib.umn.edu/slog.phtml?url=http://www.netLibrary.com/
    summary.asp?id=138523$$DNorth Campus access$$OBBI004876460</linktorsrc>
        <linktorsrc>$$V$$Uhttp://www.netLibrary.com/summary.asp?id=138523$$DNorth Campus access$$OBBI004876460</linktorsrc>
      </links>
      <search>
        <creatorcontrib>NetLibrary, Inc.</creatorcontrib>
        <creatorcontrib>Net Library, Inc</creatorcontrib>
        <title>Language development and learning to read the scientific study of how language development affects reading skill /</title>
        <subject>Electronic books.</subject>
        <general>[electronic resource] :</general>
        <isbn>142372612X</isbn>
        <recordid>BBI004876460</recordid>
        <searchscope>SOUTH</searchscope>
        <scope>SOUTH</scope>
        <creatorcontrib>Diane McGuinness</creatorcontrib>
        <creatorcontrib>McGuinness, D</creatorcontrib>
        <creatorcontrib>Diane McGuinness.</creatorcontrib>
        <title>Language development and learning to read : the scientific study of how language development affects reading skill /</title>
        <subject>Reading Research.</subject>
        <subject>Language acquisition Research.</subject>
        <general>MIT Press,</general>
        <isbn>0262134527</isbn>
        <creationdate>2005</creationdate>
        <sourceid>BBI</sourceid>
        <recordid>BBI004550753</recordid>
        <rsrctype>book</rsrctype>
        <searchscope>NORTH</searchscope>
        <searchscope>BBI</searchscope>
        <scope>NORTH</scope>
        <scope>BBI</scope>
      </search>
      <sort>
        <creationdate>2005</creationdate>
      </sort>
      <facets>
        <collection>NINTE</collection>
        <collection>KINTE</collection>
        <collection>LINTE</collection>
        <toplevel>online_resources</toplevel>
        <creatorcontrib>NetLibrary, Inc</creatorcontrib>
        <genre>Electronic books</genre>
        <language>eng</language>
        <creationdate>2005</creationdate>
        <topic>Reading-Research</topic>
        <topic>Language acquisition-Research</topic>
        <collection>NWILS</collection>
        <toplevel>available</toplevel>
        <creatorcontrib>McGuinness, D</creatorcontrib>
        <prefilter>books</prefilter>
        <rsrctype>books</rsrctype>
        <classificationlcc>L - Education.-Theory and practice of education-Teaching (Principles and practice)-Reading (General)</classificationlcc>
      </facets>
      <dedup>
        <t>1</t>
        <c2>142372612X</c2>
        <c3>languagedevelopmentaadingskill</c3>
        <c4>2005</c4>
        <f3>142372612X</f3>
        <f5>languagedevelopmentaadingskill</f5>
        <f6>2005</f6>
        <f7>language development and learning to read the scientific study of how language development affects reading skill</f7>
        <f8>mau</f8>
        <f9>x, 494 p. :</f9>
        <f10>mit press</f10>
        <f11>mcguinness diane</f11>
      </dedup>
      <frbr>
        <t>1</t>
        <k1>$$Kmcguinness diane$$AA</k1>
        <k3>$$Klanguage development and learning to read the scientific study of how language development affects reading skill$$AT</k3>
      </frbr>
      <delivery>
        <institution>$$VNORTH$$OBBI004876460</institution>
        <institution>$$VSOUTH$$OBBI004876460</institution>
        <delcategory>$$VOnline Resource$$OBBI004876460</delcategory>
        <institution>$$VNORTH$$OBBI004550753</institution>
        <delcategory>$$VPhysical Item$$OBBI004550753</delcategory>
      </delivery>
      <enrichment>
        <classificationlcc>LB1050.6</classificationlcc>
      </enrichment>
        <ranking>
        <booster1>1</booster1>
        <booster2>1</booster2>
        </ranking>
      <addata>
        <addau>NetLibrary, Inc</addau>
        <eissn>0262134527 0765805723</eissn>
        <isbn>142372612X</isbn>
        <oclcid>61704190</oclcid>
        <btitle>Language development and learning to read the scientific study of how language development affects reading skill</btitle>
        <aulast>McGuinness</aulast>
        <aufirst>Diane</aufirst>
        <au>McGuinness, Diane</au>
        <date>2005</date>
        <risdate>c2005.</risdate>
        <isbn>0262134527</isbn>
        <format>book</format>
        <ristype>BOOK</ristype>
        <notes>Includes bibliographical references (p. [447]-477) and indexes.</notes>
        <cop>Cambridge, Mass.</cop>
        <pub>MIT Press</pub>
        <lccn>2004062118</lccn>
        <btitle>Language development and learning to read : the scientific study of how language development affects reading skill</btitle>
        <genre>book</genre>
      </addata>
    </record>

     

     

     

     

     

     

    • Was this article helpful?