Skip to main content
ExLibris
  • Subscribe by RSS
  • Ex Libris Knowledge Center

    Matching Programs

    Matching programs are specified within the <step type="handler"/> element of the XML files. These programs tell the Dedup algorithm which field comparison to perform to determine which records are duplicates. Some of the programs are employed for general use on all data elements, while others apply to specific data elements.
    The programs compare an original record to a candidate record. The description below refers to the match points, as defined in the out of the box files.

    General Programs

    This section describes the programs that perform comparisons on specified fields in the Dedup section of the PNX record.

    DedupStringComparator

    This program compares the string values of a specified field (such as f9) in the original and candidate records.
    </handler>
     <handler id="CDLSubShortTitle">
     <fieldID>f9</fieldID> <name>com.exlibris.primo.publish.platform.dedup.comparator.DedupStringComparator
     </name>
     <arguments>
         <argument name="match">-450</argument>
     </arguments>
    </handler>
    A DedupStringComparator Handler Example
    To compare two strings, this handler performs the following steps:
    1. If the values for both records are null, return a value of both_missing. Otherwise, continue with next step.
    2. If one of the values is null, return a value of one_missing. Otherwise, continue with next step.
    3. If both values match exactly, return a value of match (-450). Otherwise, continue with next step.
    4. If one of the values is a substring of the other, return a value of within. Otherwise, return a value of mismatch (if not specified, return 0).
    If a return value (such as mismatch) is not defined in an <argument> element (as shown in the following figure), a return value of 0 is used.</argument>

    CDLMainEntrySerialComparator

    This program compares the string value of the specified field (such as f11).
    <handler id="CDLMainEntry_se">
       <fieldID>f11</fieldID> <name>com.exlibris.primo.publish.platform.dedup.cdlimpl.CDLMainEntrySerial
    Comparator</name>
       <arguments>
              <argument name="match">+200</argument>
              <argument name="keywords_weight_factor" param="59">75</argument>
              <argument name="keywords_order_base_weight" param="59">25</argument>
            <argument name="mismatch">-250</argument>
       </arguments>
    </handler>
    A CDLMainEntrySerial Handler Example
    To Compare two string values from the dedup records:
    1. If one of the values is null, then return 0.
    2. If both values are identical, return the value of match.
    3. Count the number of words that are equal in both entries and perform the following checks:
      • If more then 60% the words are equal, take the ratio between the equal words and the number of words in the longest title and multiply it by the value of keywords_weight_factor (+75).
      • If the words that are common between the two titles are a substring to the short title, or vice versa, then add the value found in the previous check with the value of keywords_order_base_weight (+25).
    4. If no match is found, return the value of mismatch (-600).

    DedupNumericComparator

    This program compares the date value of the specified field (such as f6).
    <handler id="CDLDate">
        <fieldID>f6</fieldID>
        <name>com.exlibris.primo.publish.platform.dedup.comparator.
    DedupNumericComparator</name>
        <arguments>
              <argument name="match">+200</argument>
              <argument name="within" param="2">-25</argument>
              <argument name="mismatch">-250</argument>
        </arguments>
     </handler>
    A DedupNumericComparator Handler Example
    Compare the string values from the dedup records:
    1. If one of the values is null, then return 0.
    2. If both values match, return the value of match (+200).
    3. If the difference between the two date fields is within the value of the parameter’s attribute (2), then return the value of within (-25).
    4. Otherwise, return the value of mismatch (-250).

    CDLMainEntryComparator

    This program compares the string value of the specified field (such as f11).
    <handler id="CDLMainEntry">
        <fieldID>f11</fieldID>
        <name>com.exlibris.primo.publish.platform.dedup.cdlimpl.
    CDLMainEntryComparator</name>
        <arguments>
             <argument name="match">+125</argument>
             <argument name="both_missing">+75</argument>
             <argument name="one_missing">+25</argument>
             <argument name="keywords_weight_factor" param="49">80</argument>
             <argument name="keywords_order_base_weight" param="49">10</argument>
             <argument name="mismatch">-200</argument>
        </arguments>
    </handler>
    A CDLMainEntryComparator Handler Example
    Compare two string values from the dedup records:
    1. If both of the values are missing, return the value of both_missing (+75). If a value is not specified, return 0.
    2. If one of the values is missing, return the value of one_missing (+25). If a value is not specified, return 0.
    3. If both values match, return the value of match (+125).
    4. Count the number of words that are equal in both entries and perform the following checks:
      • If more then 60% of the words are equal, take the ratio between the equal words and the number of words in the longest title and multiply it by the value of keywords_weight_factor (+80).
      • If the words that are common between the two titles are a substring of the short title, or vice versa, then add the value found in the previous check to the value of keywords_order_base_weight (+10).
    5. If no matches are found, return the value of mismatch (-200).

    Specific Programs

    This section describes the programs that perform comparisons on specific fields in the Dedup section of the PNX record.

    CDLIDSerialComparator

    This is a complex program that compares the record ID (usually the LCCN for MARC data sources) and the ISSN of a candidate record with the corresponding fields of the original record and assigns a point value based on the checks performed.
    <handler id="CDLID">
       <fieldID>f1,f2,f3,f4,f5</fieldID>
       <name>com.exlibris.primo.publish.platform.dedup.cdlimpl.CDLIDSerialComparator
       </name>
       <arguments>    
            <argument name="recID_match">+200</argument>
            <argument name="recID_recIDInvalid_match">+100</argument>
            <argument name="recIDInvalid_match">+50</argument>
            <argument name="recID_mismatch">-470</argument>
            <argument name="recID_recIDInvalid_mismatch">-50</argument>
            <argument name="ISSN_match">+200</argument>
            <argument name="ISSNInvalid_match">+50</argument>
            <argument name="ISSNCanceled_match">+10</argument>
            <argument name="ISSN_ISSNInvalid_match">+100</argument>
            <argument name="ISSN_ISSNCanceled_match">+50</argument>
            <argument name="ISSNInvalid_ISSNCanceled_match">+30</argument>
            <argument name="ISSN_ISSN_mismatch">-250</argument>
      </arguments>
    </handler>
    A CDLIDSerialComparator Handler Example
    To compare the IDs of the original record and the candidate record, this program performs the following checks on the f1, f2, f3, f4, and f5 fields of the Dedup section:
    1. The program performs the RECID comparisons listed in the following table.
      # Original Candidate Return Value (# of Points)
      1
      RECID (f1)
      RECID (f1)
      recID_match (+200)
      2
      RECID (f1)
      RECID_INVALID (f2)
      recID_recIDInvalid_match (+100)
      3
      RECID_INVALID (f2)
      RECID (f1)
      recID_recIDInvalid_match (+100)
      4
      RECID_INVALID (f2)
      RECID_INVALID (f2)
      recIDInvalid_match (+50)
    2. If the program finds a match, it saves the corresponding value from the Return Value column and continues with Step 5 to check the ISSNs. Otherwise, the program continues with the next step.
    3. If the original RECID (f1) and the candidate RECID (f1) exist, the program saves the value recID_mismatch (-470) and continues with Step 5 to check the ISSNs. Otherwise, the program continues to the next step.
    4. If either of the following statements is true, the program saves the value recID_recIDInvalid_mismatch (-50) and continues with the next step to check the ISSNs.
      • The original RECID (f1) and the candidate RECID_INVALID (f2) exist.
      • The original RECID_INVALID (f2) and the candidate RECID (f1) exist.
      Otherwise, the program continues to the next step.
    5. The program performs the ISSN comparisons listed in the following table:
      Test Original Candidate Return Value
      (# of Points)
      1
      ISSN (f3)
      ISSN (f3)
      ISSN_match (+200)
      2
      ISSN_INVALID (f4)
      ISSN_INVALID (f4)
      ISSNInvalid_match (+50)
      3
      ISSN_CANCELED (f5)
      ISSN_CANCELED (f5)
      ISSNCanceled_match (+10)
      4
      ISSN (f3)
      ISSN_INVALID (f4)
      ISSN_ISSNInvalid_match (+100)
      5
      ISSN (f3)
      ISSN_CANCELED (f5)
      ISSN_ISSNCanceled_match (+50)
      6
      ISSN_INVALID (f4)
      ISSN_CANCELED (f5)
      ISSNInvalid_ISSNCanceled_match (+30)
    6. If the program finds a match, it saves the corresponding value from the Return Value column and continues with Step 8. Otherwise, the program continues with the next step.
    7. If the original ISSN (f3) and the candidate ISSN (f3) exist, the program saves the value ISSN_ISSN_mismatch (-250).
    8. The program compares the return values from the RECID and ISSN checks and returns the highest value, disregarding the sign of the number (for example, a return value of -650 is higher than +50).

    CDLIDComparator

    This is a complex program that compares the record ID (usually the LCCN for MARC data sources) and the ISBN of a candidate record with the corresponding fields of the original record and assigns a point value based on the checks performed.
    <handler id="CDLID">
      <fieldID>f1,f2,f3,f4</fieldID>
      <name>com.exlibris.primo.publish.platform.dedup.cdlimpl.CDLIDComparator
      </name>
        <arguments>
            <argument name="recID_match">+200</argument>
            <argument name="recID_recIDInvalid_match">+100</argument>
            <argument name="recIDInvalid_match">+50</argument>
            <argument name="recID_mismatch">-470</argument>
            <argument name="recID_recIDInvalid_mismatch">-50</argument>
            <argument name="ISBN_match">+85</argument>
            <argument name="ISBN_ISSN_match">+30</argument>
            <argument name="ISSN_ISSN_match">+10</argument>
            <argument name="ISSN_ISBN_mismatch">-225</argument>
      </arguments>
    </handler>
    A CDLIDComparator Handler Example
    To compare the IDs of the original record and the candidate record, the program performs the following checks on the f1, f2, f3, and f4 fields of the Dedup section:
    1. The program performs the RECID comparisons listed in the following table:
      # Original Candidate Return Value (# of Points)
      1
      RECID (f1)
      RECID (f1)
      recID_match (+200)
      2
      RECID (f1)
      RECID_INVALID (f2)
      recID_recIDInvalid_match (+100)
      3
      RECID_INVALID (f2)
      RECID (f1)
      recID_recIDInvalid_match (+100)
      4
      RECID_INVALID (f2)
      RECID_INVALID (f2)
      recIDInvalid_match (+50)
    2. If the program finds a match, it saves the corresponding value from the Return Value column and continues with Step 5 to check the ISBNs. Otherwise, the program continues with the next step.
    3. If the original RECID (f1) and the candidate RECID (f1) exist, the program saves the value recID_mismatch (-470) and continues with Step 5 to check the ISBNs. Otherwise, the program continues to the next step.
    4. If either of the following statements is true, the program saves the value recID_recIDInvalid_mismatch (-50) and continues with the next step to check the ISBNs.
      • The original RECID (f1) and the candidate RECID_INVALID (f2) exist.
      • The original RECID_INVALID (f2) and the candidate RECID (f1) exist.
      Otherwise, the program continues to the next step.
    5. The program performs the ISBN comparisons listed in the following table:
      # Original Candidate Return Value (# of Points)
      1
      ISBN (f3)
      ISBN (f3)
      ISBN_match (+85)
      2
      ISBN (f3)
      ISSN_INVALID (f4)
      ISBN_ISSN_match (+30)
      3
      ISSN_INVALID (f4)
      ISBN (f3)
      ISBN_ISSN_match (+30)
      4
      ISSN_INVALID (f4)
      ISSN_INVALID (f4)
      ISSN_ISSN_match (+10)
    6. If the program finds a match, it saves the corresponding value from the Return Value column and continues with Step 8. Otherwise, the program continues with the next step.
    7. If any of the following statements is true, the program saves the value ISSN_ISBN_mismatch (-225) and continues with the next step.
      • The original ISSN_INVALID (f4) and the candidate ISBN (f3) exist.
      • The original ISSN_INVALID (f4) and the candidate ISSN_INVALID (f4) exist.
      • The original ISBN (f3) and the candidate ISSN_INVALID (f4) exist.
      • The original ISBN (f3) and the candidate ISBN (f3) exist.
      Otherwise, the program continues with the next step.
    8. The program compares the return values from the RECID and ISBN checks and returns the highest value, disregarding the sign of the number (for example, a return value of -470 is higher than +85).

    CDLTitleSerialComparator

    This program compares the full title for serials (f7 - f8), as follows:
    1. If the f7 fields from the original and candidate records are equal, perform the following checks. Otherwise, continue with the next step.
      • If a word is from the common word list (see CDLSeCommonTitleList.xml File), return a value of full_common_match (+135).
      • If a word is not part of the common word list, return a value of full_match (+600).
    2. If the f8 fields from the original and candidate records are equal, perform the following checks. Otherwise, continue with the next step.
      • If a word is in the common word list (see CDLSeCommonTitleList.xml File), return a value of full_truncated_common_match (+135).
      • If a word is not in the common word list, return a value of full_truncated_match (+175).
    3. If any words are common in both titles, perform the following checks. Otherwise, return a value of mismatch (-600).
      • If more than half of the words are common, divide the number of common words by the number of words in the longest title and then multiply it by the value of keywords_weight_factor (+75).
      • If any of the common words are a substring of the short title, or visa versa, return the sum of the previous value and the value of keywords_order_base_weight (+50)

    CDLTitleComparator

    This program compares the long title of the records for non-serials (f7), as follows:
    1. If the titles are equal, perform the following checks. Otherwise, go to the next step.
      • If the length of the title is less than nine characters, return a value of 0.
      • Otherwise, return a value of match (+600).
    2. If one title is a substring of the other title, return a value of within (+350). Otherwise, continue with the next step.
    3. If any words are common in both titles, perform the following checks. Otherwise, return a value of mismatch (+350).
      • If more than half of the words are common, divide the number of common words by the number of words in the longest title and then multiply it by the value of keywords_weight_factor (+450).
      • If any of the common words are a substring of the short title, or visa versa, return the sum of the previous value and the value of keywords_order_base_weight (+50).

    CDLDateSerialComparator

    This program compares the year (f6) of the original and candidate records, as follows:
    1. If the year does not exist for either of the records, return a value of 0. Otherwise, continue with the next step.
    2. If the year is the same for both records, return a value of match (+225). Otherwise, continue with the next step.
    3. If the difference between the year values from both records is at most 1, return a value of within1 (50). Otherwise, continue with the next step.
    4. If the difference between the year values from both records is at most 2, return a value of within2 (25). Otherwise, continue with the next step.
    5. If the year values from both records are from the same decade and either of the year values ends with a 0, return a value of last_digit_zero (+20). Otherwise, return a value of mismatch (-150).

    CDLPageHandlerComparator

    This program compares the pagination of the original and candidate records by extracting the highest number within the f9 string.
    1. If one of the values is null, return a value of 0. Otherwise, continue to the next step.
    2. If both values match, perform the following checks. Otherwise, continue to the next step.
      • If both values are greater than 10, return a value of matchgt (+100).
      • If both values are less than 10, return a value of matchlt (+100).
    3. If the difference between the two numbers is less than 10, perform the following checks. Otherwise, return a value of mismatch (-225).
      • If both values are greater than 10, return a value of withingt (+50).
      • If both values are less than 10, return a value of withinlt (+20).