MARC-8


The MARC-8 charset is a MARC standard used in MARC-21 library records. The MARC formats are standards for the representation and communication of bibliographic and related information in machine-readable form, and they are frequently used in library database systems. The character encoding now known as MARC-8 was introduced in 1968 as part of the MARC format. Originally based on the Latin alphabet, from 1979 to 1983 the JACKPHY initiative expanded the repertoire to include Japanese, Arabic, Chinese, and Hebrew characters, with the later addition of Cyrillic and Greek scripts. If a character is not representable in MARC-8 of a MARC-21 record, then UTF-8 must be used instead. UTF-8 has support for many more characters than MARC-8, which is rarely used outside library data.

Technical details

MARC-8 uses a variant of the ISO-2022 encoding. It uses escape characters to represent characters beyond the 7-bit ASCII range of characters.
It generally uses the same logical BiDi ordering as Unicode.
The combining characters and base characters are in a different order than used in Unicode. The following are some examples. The combining characters are not always stored in reverse order as Unicode normalization. The MARC-21 standard describes the MARC-8 Unicode conversion issues in more detail.
Displayed
Character
Unicode
NFD
MARC-8
áa ́ ́ a
a ̣ ̂ ̂ ̣ a

Code structure

The ISO/IEC 2022 coding specifies a two-layer mapping between character codes and displayed characters. In MARC-8, character codes from the 7-bit ASCII graphic range are referred to as "G0" codes, while codes from the "high ASCII" range are referred to as the "G1" codes. Graphic character sets are designated and invoked by means of a multiple byte escape sequence consisting of the escape character, an Intermediate character sequence, and a Final character in the form ESC I F.
The following table shows the intermediate byte after the ESC byte, and the corresponding ASCII characters.
The following table shows the final bytes in hexadecimal and the corresponding ASCII characters after the intermediate bytes.
BytesCharactersNameTypeComment
311Chinese, Japanese, Korean MBCS
322Basic HebrewSBCS
333Basic ArabicSBCS
344Extended ArabicSBCS
42BBasic Latin SBCS
21 45!EExtended Latin SBCSThe 21 technically is a second byte of the Intermediate segment of this escape sequence.
4ENBasic CyrillicSBCS
51QExtended CyrillicSBCS
53SBasic GreekSBCS

The EACC is the only multibyte encoding of MARC-8, it encodes each CJK character in three ASCII bytes.
For example, to encode the U+4EBA CJK character you will need the following bytes
\x1B\x24\x31\x21\x30\x64
The \x1B\x24\x31 switches to EACC/CJK, and the \x21\x30\x64 corresponds to the U+4EBA.

Custom set extension

In addition to the ISO-2022 character sets, the following custom sets are available too. The byte designation follows the escape byte. There is no intermediate byte.
BytesCharactersNameTypeComment
62bSubscript setSBCS
67gGreek Symbol setSBCSThe alpha, beta, gamma characters normally do not round trip map to Unicode.
70pSuperscript setSBCS
73sBasic Latin SBCS