Author Matching Algorithm
Esploro has developed a sophisticated algorithm using machine-learning methodologies to match authors. The algorithm was developed and is being continuously improved by a dedicated team of data scientists.
All types of Smart Harvesting and Smart Expansion make use of the Author Matching (AM) algorithm. The process begins by trying with identifiers or emails, but these are often not available in the data and the AM algorithm is therefore used.
The following sections will describe the data used by the algorithm and some of the key features it uses. It also explains the ranking in use.
Data
The algorithm uses the following data from the Researcher record:
- Name and name variants
- Affiliations
- Research topics
- Area of expertise
- Biographical info, e.g. education, honors
- Metadata from the most recent assets that have already been associated with the researcher:
- Title
- Authors (to find co-authors)
- Subjects
- Abstract
- Year
- Journal title
The algorithm uses the following data from the candidate assets:
- Author names
- Title
- Subjects
- Abstract
- Year
- Journal title
Features
The algorithm uses multiple "features". A feature is an individual measurable property or characteristic of the data. This section outlines the main features in use.
Name Features
The algorithm uses several features that match names considering name similarity, name variants, and name frequencies.
Embedded Text Features
This group of features extracts concepts and subject entities from the various texts available both on the researcher and asset metadata elements to help us determine how close the author of the asset is to the researcher in terms of subject area. This is done by grouping words semantically and using and creating text vectors.
Semantic Features
This group of features that uses state-of the-art Natural Language Processing algorithms in order to determine how close the textual data of the researcher is to that of the candidate asset. We train a Neural Network based on over 100 million abstracts from our Central Discovery Index to create "word embedding". These are vector representations of all words appearing in this corpus in a manner that groups together similar words. These vector representations, in turn, allow us to calculate a "distance" between words and texts in general.
Additional Features
The algorithm also makes use of the following features:
- Matching affiliations of author and researcher – this matching is based on the affiliation Esploro has for the researcher and any affiliation for the author in the record. In addition to specific affiliations, the countries of affiliations are taken into account.
- Co-author network – researchers tend to collaborate with each other. Esploro has created a network of co-authors for the assets that have already been associated with the researcher.
- Date matching – the algorithm matches the date of the asset with known dates for the researcher.
Ranking
The algorithm gets as input researcher and asset data and runs the data via the features. At the end of this process, the researcher and the specific author the researcher was matched with in the asset is assigned a rank indicating the level of confidence in the match:
- Matched on ID
- Very strong match
- Strong Match
- Uncertain Match
- No match
The rank is used to determine the Smart Harvesting approval workflows.