Author Matching Algorithm

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Esploro has developed a sophisticated algorithm using machine-learning methodologies to match authors. The algorithm was developed and is being continuously improved by a dedicated team of data scientists.

All types of Smart Harvesting and Smart Expansion make use of the Author Matching (AM) algorithm. The process begins by trying with identifiers or emails, but these are often not available in the data and the AM algorithm is therefore used.

The following sections will describe the data used by the algorithm and some of the key features it uses. It also explains the ranking in use.

Data

The algorithm uses the following data from the Researcher record:

Name and name variants
Affiliations
Research topics
Area of expertise
Biographical info, e.g. education, honors
Metadata from the most recent assets that have already been associated with the researcher:
- Title
- Authors (to find co-authors)
- Subjects
- Abstract
- Year
- Journal title

The algorithm uses the following data from the candidate assets:

Author names
Title
Subjects
Abstract
Year
Journal title

Features

The algorithm uses multiple "features". A feature is an individual measurable property or characteristic of the data. This section outlines the main features in use.

Name Features

The algorithm uses several features that match names considering name similarity, name variants, and name frequencies.

Embedded Text Features

This group of features extracts concepts and subject entities from the various texts available both on the researcher and asset metadata elements to help us determine how close the author of the asset is to the researcher in terms of subject area. This is done by grouping words semantically and using and creating text vectors.

Semantic Features

This group of features that uses state-of the-art Natural Language Processing algorithms in order to determine how close the textual data of the researcher is to that of the candidate asset. We train a Neural Network based on over 100 million abstracts from our Central Discovery Index to create "word embedding". These are vector representations of all words appearing in this corpus in a manner that groups together similar words. These vector representations, in turn, allow us to calculate a "distance" between words and texts in general.

Additional Features

The algorithm also makes use of the following features:

Matching affiliations of author and researcher – this matching is based on the affiliation Esploro has for the researcher and any affiliation for the author in the record. In addition to specific affiliations, the countries of affiliations are taken into account.
Co-author network – researchers tend to collaborate with each other. Esploro has created a network of co-authors for the assets that have already been associated with the researcher.
Date matching – the algorithm matches the date of the asset with known dates for the researcher.

Ranking

The algorithm gets as input researcher and asset data and runs the data via the features. At the end of this process, the researcher and the specific author the researcher was matched with in the asset is assigned a rank indicating the level of confidence in the match:

Matched on ID
Very strong match
Strong Match
Uncertain Match
No match

The rank is used to determine the Smart Harvesting approval workflows.