METS encoded UTF-8 with byte order mark may cause the ingest to fail
- Product: Rosetta
- Product Version: v5
- Relevant for Installation Type: Local
Description
When creating METS and other metadata files for ingest it's important to be aware of what kind of UTF-8 encoding is being used.
Resolution
METS encoded UTF-8 with byte order mark (BOM) may cause the METS ingest to fail with "unexpected CDATA" error.
The presence of the extra BOM characters can be detected by using the 'less' command to view the ie.xml which will reveal the following string that preeceeds the header:
<U+FEFF><mets:mets xmlns:mets="http://www.loc.gov/METS/">
Therefore all metadata files (e.g. METS, CSV, dc.xml, etc.) need to be encoded UTF-8 without BOM.
Refer to the Additional Information section below for more information about BOM.
Additional Information
The byte order mark (BOM) is a Unicode character, U+FEFF BYTE ORDER MARK (BOM), whose appearance as a magic number at the start of a text stream can signal several things to a program consuming the text:
-What byte order, or endianness, the text stream is stored in;
-The fact that the text stream is Unicode, to a high level of confidence;
-Which of several Unicode encodings that text stream is encoded as.
BOM use is optional, and, if used, appears at the start of the text stream.
- Article last edited: 03-January-2017