• # Handling of UTF-8 characters not representable in MARC-8

• Article Type: General
• Product: Aleph
• Product Version: 19.01

Description:
In unicode-to-marc8 conversion, when a Unicode character is not in our conversion table (because there is no MARC-8 representation for the character), Aleph generated a sequence \U+nnnn\ [where "nnnn" is the hexidecimal representation of the character].

Instead of this, Aleph should follow the "Lossless conversion to MARC-8 encoding" convention described at
http://www.loc.gov/marc/specifications/speccharconversion.html#specissues

Lossless conversion to MARC-8 encoding

In the lossless conversion method, a Unicode character that is not in the MARC-8 repertoire is replaced by a hexadecimal Numeric Character Reference (NCR) identifying the specific unconvertable Unicode code point. This method preserved precisely the information content of the Unicode record although the result may result in a cryptic display, and additional conversion techniques will be required to reconstruct the record exactly in Unicode. The Numeric Character Reference consists only of ASCII characters, thus can be carried into the MARC-8 target record.

The structure of the NCR is &#xXXXX; where:
• & and ; (the ampersand and semicolon) surround the Reference data
• #x designates that the value expressed is in hexadecimal notation
• XXXX is the hexadecimal representation of the code point for the Unicode character expressed in hex digits 0123456789ABCDEF. Some characters, primarily infrequently encountered CJK ideographs, may require more than four hexadecimal digits. The NCR can contain more than four digits if they are needed.

It is not correct to represent a non-ASCII character in an NCR by its UTF-8 octets; only the scalar value of the code point is allowed.

<end www.loc.gov document>

Resolution:
Corrected by rep_changes:

v19 - rep_change 752
v20 - rep_change 2245

Implementation Notes:

1. If the conversion to NCR (format &#xXXXX;) of Unicode characters which do not exist in MARC-8 is desired, col.6 in tab_character_conversion_line for the routine UTF_TO_MARC8 should be set to Y.

2. Run UTIL/H/3 to synchronize the header.

And, additionally, by v20 rep_change 2426.

v20 rep_change 2426: Description: OCLC (MARC-8 encoding) - When importing records which contained characters of the form &#xXXXX;, where XXXX is the Unicode value (e.g. &#x200F; = right to left mark), the character was imported to Aleph as is, &#xXXXX;, instead of replacing it by the Unicode value.

Solution: This has been corrected.

• Article last edited: 10/8/2013