Skip to main content
ExLibris
  • Subscribe by RSS
  • Ex Libris Knowledge Center

    Smart Expansion via CSV or Excel

    Overview

    Smart Expansion via CSV /Excel (SE via CSV/Excel) accepts a list of publications that are known to belong to the researcher with basic metadata and the researcher ID and tries to fetch them from CDI. The records are normalized to Esploro assets and author matching is done.
    For an overview of the flow and how it works, see the How it Works section.

    Configuring Smart Expansion via CSV/Excel

    Configuring Smart Expansion for CSV/Excel requires preparing the spreadsheet and updating the profile, as described below.
    To Configure Smart Expansion via CSV/Excel
    1. Prepare the spreadsheet as described in CSV/Excel File Format.
    2. Go to Repository > Manage Profiles.
    3. Select the Comma Separated Values (CSV) Excel File profile. The Smart Expansion Profiles page displays.
    4. Leave the Active checkbox as marked.
    5. In the Asset Approval section select whether the asset will be automatically approved or will require manual intervention. 
      • Never automatically approve the asset: The asset is never approved. Operators must approve before the asset will display in the portal and profiles (same as for a manual deposit).
      • Always automatically approve the asset: The asset is always approved and displays in the portal upon load. It displays in Researcher Profiles when the
        author-researcher match is approved. An approved asset displays in the portal and in the active researcher profile. It displays in affiliated co-author profiles if the match is approved.
      • Conditionally approve the asset: The asset is approved only if the specified conditions are met. The following conditions are possible:
        • If asset type is any of: Multiple asset types can be selected.

          AND/OR
        • If asset has a DOI or PMID 

        If the researcher could not be found as an author he or she is added with the status "Added by System". 

        Author matching tasks are created for all asset authors. Assets that are pending approval can be approved in the task list at Repository > Author Matching Approval Task List and in the Asset Approval page at Repository > Smart Harvesting > Asset Approval (see here for more information).In addition, these assets can be accessed via the Smart Expansion via CSV  – Asset approval task in the Tasks Widget (see Managing Widgets).
    6. Select the Update "Last Smart Harvest Date" to have Smart Harvesting run in an ongoing mode. It is recommended to check this option unless the input file includes a partial list of researcher publications. Note that out-of-the-box this option is selected.
      Update last Smart Harvest Date in the General Details section.
    7. The AUTHOR MATCHING APPROVAL CONFIGURATION options allow you to set the level of certainty at which assets will be automatically added to the researcher profile.
      Author matching approval.
       

    Running Smart Expansion via CSV/Excel

    To run Smart Expansion via CSV or Excel
    1. Upload the file in the File field in the Run section.
    2. (New for September) Select the Notify Researchers checkbox in order to notify researchers about new assets that were added to their profile. Note that this option is only enabled when the NEW ASSETS ADDED TO RESEARCHER PROFILE NOTIFICATION job is active. See Letter for New Assets Added.
    3. Select Run Now. A message will display to let you know the job started.
    4. You can view the jobs at Admin > Monitor jobs.
    5. Select the Scheduled tab to view scheduled jobs, SP Scheduled to view jobs scheduled for a release, Running to view the jobs in progress and History to view previously run jobs.
    6. You can also view jobs by going to the list of Smart Harvesting jobs from Repository > Manage Profiles, and selecting the Monitor Captures button.
    For better results, it is recommended to first run Smart expansion before Smart Harvesting .

    CSV/Excel File Format

    The following table describes the required format for the spreadsheet.
    There is a limit of 50 000 rows for the spreadsheet.
    Column Name Description Mandatory Notes

    USERID 
    OR
    ALMAID

     

    Researcher ID

    Y

    ID in the user platform

    TITLE

    Title

    Y unless there is a DOI or PMID

     

    DOI

    DOI

    Y unless there is title

     

    PMID

    PMID

    Y unless there is title

     

    How it Works

    Workflow

    The following describes the main workflow for Smart Expansion via CSV/Excel. 

    1. Match Assets: Match assets in the input file to remove duplicates. This can happen because multiple affiliated researchers exist for the same asset. 
    2. Fetch Records: Fetch records from CDI and deduplicate the results. This is needed because CDI may have multiple records for the same asset. It can also happen because the deduplication in stage 1 failed due to lack of data.
    3. Author Matching: The "active" researcher is automatically approved. If for some reason the job cannot find the active researcher in the list of asset authors, the researcher is added by the system. The author match in this case is "Added by System" and considered approved. Matching approval for Co-authors is per the configuration in the profile.
    4. Update Last Smart Harvesting Date: If enabled in profile configuration, the "Last Smart Harvesting Date" is updated from beginning of last publication year of all the assets loaded for the researcher. Therefore if the last year is 2020, it should be 20200101. This will ensure that researchers are including in ongoing Smart Harvesting.
    5. Run Jobs: The merge non-affiliated researchers and delete redundant non-affiliated researchers (i.e., non-affiliated researchers with no assets) jobs are run. See also Working with Smart Harvesting.

    How the Data is Used by the Job

    The metadata included in the spreadsheet is used for several purposes:
    1. Deduping rows - Since a row is added per researcher and researchers collaborate, the job tries to match rows and work on them together. This is done first based on the data in the files and then again based on the data from the records fetched from CDI (in case there is insufficient data).
    2. Fetching records from CDI.
    3. Future use – validation of the input. This can be important if the quality of the source data is low.
    The sections below provide more details on how the metadata is used.

    Deduping Records in the CSV 

    The records (each row) are deduped based on the following keys:

    • DOI + PUBYEAR or year from PUBDATE
    • PMID + PUBYEAR or year from PUBDATE
    • Normalized title + PUBYEAR or year from PUBDATE
    • Normalized title (as long as it is > 4 words ) + normalized Publication title 
    • Normalized title (as long as it is > 4 words)  + ISSN 

    Fetching Records from CDI

    If the record has a DOI or PMID the identifier is used. If there is no DOI or PMID, the query to CDI is based on the title and the researcher name.

    Job Report 

    The report includes the following events:

    • Smart Expansion ran for researcher (N) -> the number and list of valid researchers included in the CSV, i.e., for whom a matching ID was found.
    • Records dedup - Stage 1 (N) -  Number of records deduped within the CSV. The event includes the row numbers.
    • Records dedup – Stage 2 (N) – Number of records deduped after records are fetched from CDI. The event includes row number with the asset ID created from one of the rows.
    • Records not found (N) -> number and list of records which were not found in CDI. The event includes metadata from the row to identify the record (row, title, DOI). If there are duplicate rows they are included in brackets.
    • For future use when validation is added. The number and list of records which were found in CDI but could not be validated. The event includes metadata from the row - (row, title, DOI). If there are duplicate rows they are included in brackets.
    • Records not imported -- matched with asset/provisional asset. Number, list and reason of records that matched with an existing asset/task.
    • Skipped records – too many authors (N)
    • Invalid rows (N) - number and list of rows with invalid data –  Researcher could not be found for the ID or the row could not be validated.
    • Total assets imported (N) – number of assets imported. The Event includes the asset ID and title.  
    • Automatically approved author matches (N) – number of automatically approved author-researcher matches excluding active researchers.
    • Assets with missing researcher/added by system
    • New non-affiliated researchers created (N) -  number and list
    • General error (N)
      • Python is down
      • Unexpected errors
    • Was this article helpful?