Skip to main content
ExLibris
  • Subscribe by RSS
  • Ex Libris Knowledge Center

    Structure of the XML File

    If you are working with Primo VE and not Primo, see Understanding the Dedup and FRBR Processes (Primo VE).

    Primo uses the following elements to determine match criteria and thresholds:
    • Handlers—these are the fields or group of fields that the algorithm uses for matching. The handler includes the program ("class"), which compares the fields and calculates match points by adding and subtracting points based on the comparisons.
    • Thresholds—this section defines the matching stages. The standard algorithm includes two thresholds: quick match and full match.
    • Steps—this section specifies the stages of the match.
    • Common title list—this section defines a list of additional files used by the algorithm to determine matches.

    Handlers

    The <step type="handler"> element (see the following figure) defines each matching program to run. The handlers have the following elements:
    • <handler id> – the handler ID. This ID is used in the steps section.
    • <fieldid> – this is the field or group of fields from the PNX dedup section. Multiple fields should be separated by a comma
    • <name> – the name of the program used to match the fields. The programs are explained below.
    • <arguments> the parameters of the specified program. These parameters also assign/subtract match points.

    <handlers>
      <handler id="CDLID">
        <fieldID>f1,f2,f3,f4</fieldID>
        <name>com.exlibris.primo.publish.platform.dedup.cdlimpl.CDLIDComparator
        </name>
          <arguments>
            <argument name="recID_match">+200</argument>
            <argument name="recID_recIDInvalid_match">+100</argument>
            <argument name="recIDInvalid_match">+50</argument>
            <argument name="recID_mismatch">-470</argument>
            <argument name="recID_recIDInvalid_mismatch">-50</argument>
            <argument name="ISBN_match">+85</argument>
            <argument name="ISBN_ISSN_match">+30</argument>
            <argument name="ISSN_ISSN_match">+10</argument>
            <argument name="ISSN_ISBN_mismatch">-225</argument>
        </arguments>
      </handler>

             .
             .
             .
    </handlers>

    Example handlers Element

    Thresholds

    The <step type="threshold"> element (see the following figure) defines the threshold stages, which the Dedup algorithm uses to determine whether two records are duplicates. The <threshold> element defines each stage. Points, which are defined by the handlers, are transferred to subsequent stages, where they are added or subtracted to calculate a total point value.
    The point values are compared to the values that are stored in the following elements:
    • <upper_threshold>—records with point totals that meet or exceed this value are considered a duplicate and processing is stopped.
    • <lower_threshold>—records with point totals that meet or exceed this value are considered a duplicate and processing is stopped.
    If the total falls between the upper and lower thresholds, processing continues to the next stage.
    Every threshold requires an upper point threshold that records must reach to be considered duplicates. Any threshold that is not the last can also have a lower threshold. If the points reached in the first stage are the same or less than the lower threshold, then the system will not go on to the next stage. If the threshold is met, the records are considered duplicates and matching stops. If the total number of points is between the lower and upper threshold the system will continue to the next stage.
    The thresholds do not have to be the same for the first and second (and any additional) thresholds.
    <thresholds>
       <threshold id="tr1">
            <upper_threshold>+850</upper_threshold>
            <lower_threshold>0</lower_threshold>
       </threshold>
       <threshold id="tr2">
           <upper_threshold>+875</upper_threshold>
       </threshold>
    </thresholds>   
    Example thresholds Element

    Steps

    The <steps> element lists the handlers and thresholds that the program will use to compare two records. This list defines the order in which the matching programs are run and the threshold stages.
    For example, the CDLSeMatchingProfile.xml (see the following figure) has two thresholds that represent the stages, quick match and full match. In the quick match stage, the program matches the Univeral ID/ISSNs and the full title. If a full match is necessary, additional fields are checked.
    <steps>
        <step type="handler">CDLID</step>
        <step type="handler">CDLShortTitle</step>
        <step type="handler">CDLDate</step>
        <step type="threshold">tr1</step>
        <step type="handler">CDLSubShortTitle</step>
        <step type="handler">CDLLongTitle</step>
        <step type="handler">CDLCountryOfPub</step>
        <step type="handler">CDLPagination</step>
        <step type="handler">CDLPublisher</step>
        <step type="handler">CDLMainEntry</step>
        <step type="handler">PhysicalFormat</step>
        <step type="handler">Edition</step>
        <step type="threshold">tr2</step>    
    </steps>
    Example steps Element

    Common Title List

    The <common_title_list> element defines a list of additional files used by the algorithm to determine matches. The serial XML has such a section for the common title list file (see CDLSeCommonTitleList.txt).
    <common_title_list>
            <file_name>CDLSeCommonTitleList.txt</file_name>
    </common_title_list>
    Example common_title_list Element
    For information on configuring local titles, see ClientCommonTitles.txt