Brief overview of the UTF8to16() function in Voyager Prepackaged Access Reports

Last updated
Save as PDF
Share
1. Share
2. Tweet
3. Share

Product: Voyager
Relevant for Installation Type: Multi-Tenant Direct, Dedicated-Direct, Local, TotalCare

Question

What does the UTF8to16() function do in Voyager Prepackaged Access Reports?

Answer

Data in the Oracle database is stored using the US7ASCII character set, but it's not actually US7ASCII. The idea of US7ASCII is that only 7 of the 8 bits in a byte are significant. The high-order bit should always be zero and should be ignored. So in binary, a blank (" ") is 00100000. A lower-case "g" is 01100111. The very highest ASCII character is 01111111, which is a {DEL}. As you can see, in all cases, the high-order bit is 0. There are extensions of ASCII that turn on the high-order bit to get twice as many characters, but US7ASCII is supposed to disallow or ignore those distinctions.

However, in a Voyager database, although it says it's US7ASCII, the characters are actually UNICODE. In UNICODE, all bits are significant. The first 128 characters of UNICODE are the standard ASCII characters. From then on, you have to use multiple bytes to express a character. So, you turn on the high-order bit in the first byte, which gives you a bunch more possibilities, such as 11000010 10101110. That two-byte value boils down to a single character ((r)). UNICODE characters come in different flavors. For the full set, you need four bytes. But to save space, the normal characters are stored in a single byte, then you add only the number of bytes you need. So, you need to look at the high-order bits in order to tell how many bytes to use. UTF8 means the shortest character is 8 bits or one byte in length. If it needs more, it will expand out to as many as four bytes per character, but any character can be as short as one byte.

When you hook up the ODBC drivers to a Voyager database, it looks at the character set in the Oracle database and finds that it is US7ASCII. When it brings characters over from Oracle to the PC, it assumes that it can ignore the high-order bit, so it turns it off or it tries to translate the character and finds that these funky things are not listed in its US7ASCII character set translation tables. So, that makes a mess. We solve that mess by changing the NLS_LANG setting on the PC to US7ASCII. Now that both sides are the same, the drivers stop translating. The characters still display as garbage, because Access doesn't know they're UTF8. So, you have to run the data through a function to translate the characters from what is actually UTF8 but displays as if it's ASCII with bad characters to true UNICODE. That's what the UTF8to16() function does. Ex Libris wrote it and included it in the REPORTS.MDB to solve this exact problem.

Additional Information

Usage Example:

SELECT DISTINCT BIB_TEXT.BIB_ID, utf8to16([bib_text].[TITLE_BRIEF]) AS Title_Brief, utf8to16([bib_text].[AUTHOR]) AS Author, utf8to16([bib_text].[IMPRINT]) AS Imprint
FROM BIB_TEXT
GROUP BY BIB_TEXT.BIB_ID, utf8to16([bib_text].[TITLE_BRIEF]), utf8to16([bib_text].[AUTHOR]), utf8to16([bib_text].[IMPRINT]);

Constructing custom SQL queries by request and troubleshooting unexpected results from customer-created SQL queries falls outside the scope of Support. The above has been posted for informational purposes. Voyager-L and Developer Network are useful resources for finding helpful custom SQL or obtaining assistance from peers in troubleshooting custom queries.

Article last edited: 27-Oct-2021