UTF-EBCDIC


UTF-EBCDIC is a character encoding used to represent Unicode characters. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.
To produce the UTF-EBCDIC encoded version of a series of Unicode code points, an encoding based on UTF-8 is applied first. The main difference between this encoding and UTF-8 is that it allows Unicode code points U+0080 through U+009F to be represented as a single byte and therefore later mapped to corresponding EBCDIC control codes. In order to achieve this, UTF-8-Mod uses 101XXXXX instead of 10XXXXXX as the format for trailing bytes in a multi-byte sequence. As this can only hold 5 bits rather than 6, the UTF-8-Mod encoding of codepoints above U+009F is generally larger than the UTF-8 encoding.
The UTF-8-Mod transformation leaves the data in an ASCII-based format, so each byte is fed through a reversible lookup table to produce the final UTF-EBCDIC encoding. For example, 01000001 in this table maps to 11000001; thus the UTF-EBCDIC encoding of U+0041 is 0xC1.
This encoding form is rarely used, even on the EBCDIC-based mainframes for which it was designed. IBM EBCDIC-based mainframe operating systems, such as z/OS, usually use UTF-16 for complete Unicode support. For example, DB2 UDB, COBOL, PL/I, Java and the IBM XML toolkit support UTF-16 on IBM mainframes.

Codepage layout

There are 160 characters with single-byte encodings in UTF-EBCDIC. As can be seen, the single-byte portion is similar to IBM-1047 instead of IBM-37 due to the location of the square brackets. CCSID 37 has at hex BA and BB instead of at hex AD and BD respectively.
Blue cells containing a large single-digit number are the start bytes for a sequence of that many bytes. The unbolded hexadecimal code point number shown in the cell is the lowest character value encoded using that start byte. This value can be greater than the value which would be obtained by following the start byte with continuation bytes which are all 65, if this would result in an invalid overlong form.
Orange cells with one dot are continuation bytes. The hexadecimal number shown after a "+" plus sign is the value of the 5 bits they add.
Red cells indicate start bytes which can never appear in properly encoded UTF-EBCDIC text, because any possible continuation would result in an invalid overlong form. For example, 0x76 is marked in red because even 0x76 0x73 would merely be an overlong encoding of U+005F.

Oracle UTFE

Oracle UTFE is a Unicode 3.0 UTF-8 Oracle database variation, similar to the CESU-8 variant of UTF-8, where supplementary characters are encoded as two 4-byte characters rather than a single 4- or 5-byte character. It is used only on EBCDIC platforms.
Advantages:
Disadvantages: