CDI

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Central Discovery Index (CDI)

The Central Discover Index (CDI) is used in Esploro as the source for Smart Harvesting and for auto-population of asset metadata during a manual deposit. CDI has billions of records and adds more daily from multiple sources: publishers, aggregators, and repositories of various kinds. CDI is inclusive and harvests records from all subject domains. There are over 30,000 sources.

Some numbers:

750+ million Journal Articles
730+ million Books / eBooks / Book chapters
110+ million Patents
9+ million Datasets
50+ million Conference Proceedings

See CDI Record Summary and Sources for an A-Z list of list of all the publishers, aggregators, and other content contributors that provide content to CDI.

How often is CDI updated?

CDI harvests records from the various sources on an ongoing basis. The periodicity varies between the sources from daily to monthly. Some of the key sources, including for example Crossref, are harvested daily. It is important to keep in mind that CDI is indexed twice a week which means that even if the source is harvested daily, records will be added to the index only after the bi-weekly indexing has run.

Using CDI records for Esploro Assets

As noted above, CDI gets records from many different data providers. Esploro selects a “preferred” record from all sources Esploro can legally use. The following guidelines are used to select the “preferred” record:

Records from Scopus cannot be used and are filtered out.
Preference is then given to records in the following order:
1. Records with well formatted and rich author information. This means that the first name and last name are split, and the author has affiliation/ORCID/email information. Author information is critical to the author-matching process. Generally speaking, records with good author metadata are good in other areas as well.
2. Records with DOIs.
3. Records that have subjects and/or abstracts.
4. Records from Web of Science.

Record Metadata

With so many sources, the quality of the records in CDI can vary; even records from a single source will vary. Most of the records in CDI have high quality metadata but there are some that are missing data and/or have errors. This can happen even in cases of records from very trustworthy sources, including publishers.

Note the following known issues:

There can be occasional duplicates.
Author affiliations are very often missing or messy.
Author names are sometimes inversed.
Corporate authors are sometimes incorrect.
Article numbers are missing or added as issue numbers.

The Esploro team is working on improving the results where possible.

Additional References

General Overview of Smart Harvesting Framework