AI Bibliographic Records Enrichment

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Within the Alma Community Zone, bibliographic records undergo enrichment from various sources, predominantly content providers' metadata. However, since certain book bibliographic records lack comprehensive MARC feeds, Ex Libris is exploring alternative methods for enrichment. Artificial intelligence presents an opportunity to enrich a greater number of records at scale by leveraging the full text of the book.

How does it work?

The AI-driven enrichment process utilizes the entire text of a book to create detailed metadata about the publication by extracting relevant information from the full text to populate MARC fields in the descriptive Bibliographic record.

As of February 2024, Metadata generated by AI is added to the Alma Community Zone records, focusing on enriching metadata for three specific MARC fields: Language (041 and 008), Summary (520), and LC subject headings (650). The primary focus has been on enriching Ebook Central books, which were lacking these specific metadata elements.

We are working to add more fields, including Classification Dewey and LC (082, 050), Publishing information (260/264), additional authors and contributors (100/700), table of content (505) and more.

What will the records look like?

MARC records enriched by AI will be explicitly marked as follows:

588$a Part of the metadata in this record was created by AI, based on the text of the resource.

The enhanced fields will contain a new subfield, $7, indicating that the content was generated by AI.

Due to the 588 field not being indexed in Alma, an additional 035 field with the prefix (Exl-AI) and the record number will be included temporarily to facilitate the search for these CZ records.

While more indications and features on the Alma UI and analytics may be forthcoming, the indication of AI enrichment is currently confined to the Bibliographic record itself.

For example, MMS ID 993780000000425506:

More on the process for these specific fields:

The Language is based on analyzing the language or languages of the text and converting it to the language code used in fields 041 and 008(35-37). Most CZ records, even the brief ones, already have the language code in 008, but is sometimes misrepresented. The AI generated language code will replace the one in 008 (35-37) and will be also added as field 041$a, with "$7Generated by AI" indication, so the source of the language will be clear.

The Summary is produced from the book's text, aiming to provide a concise and detailed summary of the book’s main subjects.
The Subject Headings are also generated from the entire book’s text, and are formalized to align with the LCSH vocabulary. The new subjects may be partial and describe only part of the book's possible subjects. While in some cases the subject heading will contain subdivisions (subfields $v, $x, $y, $z), in most cases it will only have the topical term (subfield $a). Efforts are underway to enhance the subject functionality.

As we are generating only these three fields in this phase, records may still lack additional important metadata as additional subjects, authors, publishing information, classification and more. We are working to generate these fields using AI and create richer and fuller records.

The AI Bib enrichment project does not replace the existing enrichment process we have with providers and other metadata sources like Library of Congress. In addition, we will not override existing rich metadata which came from MARC source with the AI generated metadata unless the data is incorrect and should be removed.

For any questions regarding this project, please write to: AI.enriched@clarivate.com

AI Bibliographic Records Enrichment Webinar

Following the webinar given in March 27th 2024, we received many questions which we will answer in the following section

Use of Full text:

The AI metadata generator for the Alma CZ generates the data from the book’s full text. We are working with publishers to cooperate on this project, as we believe it presents an opportunity for them too. We started this project with ProQuest’s Ebook Central books and plan to expand it to more providers and more collections in the Alma Community Zone.

Types of books used in this project:

There are a variety of books in Alma Community Zone in many genres. The AI tool currently works better with non-fiction books. Thus, for this phase, we will prioritize this genre. In the future, we will adjust our priorities for using the AI tool based on availability of full text and on customer usage. Based on our experience, we see that the prompt itself and the post-processing need to be adjusted for optimized results for different types of books. For example, while reading the first 30 pages of a scholarly book can provide a good summary of its content thanks to the preface and table of contents, in fiction books, the first 30 pages would generate a partial and even misleading summary.

Metadata generated:

The 3 fields we have chosen for now to focus in this release are:

Language (041)
Description (520)
LC Subject Headings (650)

We are now focusing on improving the quality of these metadata fields and planning to enhance the tool and generate additional metadata fields. The fields we will be focusing on in the future are Classification (050, 082), Table of Contents (505), additional Authors, and identifiers. Some of the metadata elements can be found in the books themselves, and the challenge is to locate and extract it. In discussion with the working groups, we decided to begin with the 520 (summary) and not 505 (table of contents) since the 520 is generated by AI, and this is the tool we are focusing on at this stage of the project, while the 505 (table of contents) can be extracted using non-AI tools

LCSH: As explained in the webinar, we have decided to start with LCSH as they are cardinal for many libraries and library patrons. The first batch of AI-generated LCSH subjects includes primarily 650$a main headings, and only a small number of the AI-generated subjects have subdivisions. The challenge of specifying the subjects with the correct subdivisions is indeed an interesting one, and there’s still more work that lies ahead until we can add subdivisions for most of the subjects. in this current stage, general LCSH terms in 650$a are improving the record quality and contributing to its discoverability. Currently, we are focusing on 650, 651 and 655, and in the future, we hope to add 600/610 (from LC Names) and also consider working with additional vocabularies, such as FAST.

Summary: The summary's quality and tone are a result of what we ask in the AI prompt. Currently, we have achieved informative and accurate summaries for non-fiction books. As we expand to other genres, we will adjust the prompt and post processing for different types of books.

The tool:

The AI engine we use mostly is GPT 3.5 and GPT4 (by OpenAI), but we are also able to test other LLMs, and we will consider working with them or another combination in the future, based on quality and cost.
Part of the solution is to process the results generated by the AI tool and improve them. Sometimes we use AI again to improve and verify the results, and sometimes the results are adjusted and processed by non-AI tools for example confirming the SH are valid LCSH.

Text formats:

To date we have tested PDF, EPUB and TXT formats. Overall, we see good results in how the AI can process the text. We are working internally on optimization to ensure cost and effort efficiencies where possible.

What the enriched records look like in Alma:

The records are clearly marked as enriched by AI (field 588), and in the specific fields (with subfield 7 – see screenshot). We will not detail which AI engine exactly was used to generate the data, as we work with several LLMs and tools to achieve the best quality. Also, we are not planning to indicate if the metadata element was reviewed by humans. Human review is an important part of our early work on this project, but we are working on automated quality assurance (QA) processes which will guarantee sufficient quality and will be optimized to the number of records we hope to process. As the tool develops with better quality and more fields, we may revisit some of the titles we have already enriched if necessary.

Ex Libris is taking steps to ensure the AI-generated metadata will interact well with other sources of record metadata in the CZ. In the Alma CZ we have one Bibliographic record for each resource, and when its MARC is sufficiently high quality, we will choose not to update it with additional changes. We can block or allow changes based on a set of rules to achieve this. We are currently working on these rules to allow keeping authorized fields (metadata created by content providers or other cataloging sources such as Library of Congress) rather than overriding them with AI generated fields, as well as replacing AI generated fields with future better metadata provided by authorized sources.

Copyright:
Copyright is an important issue that needs to be considered in any use of LLM. We are working under the guidance of legal advisors, and we take action to ensure we do not risk using copyrighted material for any unauthorized purpose.

Working with the community:

As we work to improve the quality of existing fields and add more fields to be generated by AI, the results are reviewed by an internal group of librarians from different teams, as well as by community members. We are happy to receive your feedback on the records we have already released (and answer additional questions) via the email ai.enriched@clarivate.com.

Additional uses suggested during the webinar:

We had several questions and interesting ideas about additional features and uses that AI and specifically this metadata generator can help with. For example, enriching metadata in CDI, enriching local bib data via a tool in the Alma MD Editor, improving the way search is done in Alma and in other discovery products, and more. There are indeed many future possibilities for implementing AI in our products, and Clarivate as a company is focused very much on responsibly researching and developing in this area.