Unicode equivalence


Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting standard character sets, which often included similar or identical characters.
Unicode provides two such notions, canonical equivalence and compatibility. Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E followed by U+0303 is defined by Unicode to be canonically equivalent to the single code point U+00F1. Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other. Similarly, each Hangul syllable block that is encoded as a single character may be equivalently encoded as a combination of a leading conjoining jamo, a vowel conjoining jamo, and, if appropriate, a trailing conjoining jamo.
Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. Thus, for example, the code point U+FB00 is defined to be compatible—but not canonically equivalent—to the sequence U+0066 U+0066. Compatible sequences may be treated the same way in some applications, but not in others; and may be substituted for each other in some situations, but not in others. Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true.
The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text. For each of the two equivalence notions, Unicode defines two normal forms, one fully composed, and one fully decomposed.

Character duplication

For compatibility or other reasons, Unicode sometimes assigns two different code points to entities that are essentially the same character. For example, the character "Å" can be encoded as U+00C5 or as U+212B. Yet the symbol for angstrom is defined to be that Swedish letter, and most other symbols that are letters do not have a separate code point for each usage. In general, the code points of truly identical characters are defined to be canonically equivalent.

Combining and precomposed characters

For consistency with some older standards, Unicode provides single code points for many characters that could be viewed as modified forms of other characters or as combinations of two or more characters
For consistency with other standards, and for greater flexibility, Unicode also provides codes for many elements that are not used on their own, but are meant instead to modify or combine with a preceding base character. Examples of these combining characters are the combining tilde and the Japanese diacritic dakuten.
In the context of Unicode, character composition is the process of replacing the code points of a base letter followed by one or more combining characters into a single precomposed character; and character decomposition is the opposite process.
In general, precomposed characters are defined to be canonically equivalent to the sequence of their base letter and subsequent combining diacritic marks, in whatever order these may occur.

Example

Typographical non-interaction

Some scripts regularly use multiple combining marks that do not, in general, interact typographically, and do not have precomposed characters for the combinations. Pairs of such non-interacting marks can be stored in either order. These alternative sequences are in general canonically equivalent. The rules that define their sequencing in the canonical form also define whether they are considered to interact.

Typographic conventions

Unicode provides code points for some characters or groups of characters which are modified only for aesthetic reasons, or to add new semantics without losing the original one. Such a sequence is considered compatible with the sequence of original characters, for the benefit of applications where the appearance and added semantics are not relevant. However the two sequences are not declared canonically equivalent, since the distinction has some semantic value and affects the rendering of the text.

Encoding errors

and UTF-16 do not allow all possible sequences of code units. Different software will convert invalid sequences into Unicode characters using varying rules, some of which are very lossy. This can be considered a form of normalization and can lead to the same difficulties as others.

Normalization

The implementation of Unicode string searches and comparisons in text processing software must take into account the presence of equivalent code points. In the absence of this feature, users searching for a particular code point sequence would be unable to find other visually indistinguishable glyphs that have a different, but canonically equivalent, code point representation.
Unicode provides standard normalization algorithms that produce a unique code point sequence for all sequences that are equivalent; the equivalence criteria can be either canonical or compatibility. Since one can arbitrarily choose the representative element of an equivalence class, multiple canonical forms are possible for each equivalence criterion. Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria: the composed forms NFC and NFKC, and the decomposed forms NFD and NFKD. Both the composed and decomposed forms impose a canonical ordering on the code point sequence, which is necessary for the normal forms to be unique.
In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance some typographic ligatures like U+FB03, Roman numerals like U+2168 and even subscripts and superscripts, e.g. U+2075 have their own Unicode code points. Canonical normalization does not affect any of these, but compatibility normalization will decompose the ffi ligature into the constituent letters, so a search for U+0066 as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03. Likewise when searching for the Latin letter I in the precomposed Roman numeral Ⅸ. Similarly the superscript "⁵" is transformed to "5" by compatibility mapping.
Transforming superscripts into baseline equivalents may not be appropriate however for rich text software, because the superscript information is lost in the process. To allow for this distinction, the Unicode character database contains compatibility formatting tags that provide additional details on the compatibility transformation. In the case of typographic ligatures, this tag is simply , while for the superscript it is . Rich text standards like HTML take into account the compatibility tags. For instance HTML uses its own markup to position a U+0035 in a superscript position.

Normal forms

The four Unicode normalization forms and the algorithms for obtaining them are listed in the table below.
NFD
Normalization Form Canonical Decomposition
Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.
NFC
Normalization Form Canonical Composition
Characters are decomposed and then recomposed by canonical equivalence.
NFKD
Normalization Form Compatibility Decomposition
Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.
NFKC
Normalization Form Compatibility Composition
Characters are decomposed by compatibility, then recomposed by canonical equivalence.

All these algorithms are idempotent transformations, meaning that a string that is already in one of these normalized forms will not be modified if processed again by the same algorithm.
The normal forms are not closed under string concatenation. For defective Unicode strings starting with a Hangul vowel or trailing conjoining jamo, concatenation can break Composition.
However, they are not injective and thus also not bijective. For example, the distinct Unicode strings "U+212B" and "U+00C5" are both expanded by NFD into the sequence "U+0041 U+030A" which is then reduced by NFC to "U+00C5".
A single character that will get replaced by another under normalization can be identified in the Unicode tables for having a non-empty compatibility field but lacking a compatibility tag.

Canonical ordering

The canonical ordering is mainly concerned with the ordering of a sequence of combining characters. For the examples in this section we assume these characters to be diacritics, even though in general some diacritics are not combining characters, and some combining characters are not diacritics.
Unicode assigns each character a combining class, which is identified by a numerical value. Non-combining characters have class number 0, while combining characters have a positive combining class value. To obtain the canonical ordering, every substring of characters having non-zero combining class value must be sorted by the combining class value using a stable sorting algorithm. Stable sorting is required because combining characters with the same class value are assumed to interact typographically, thus the two possible orders are not considered equivalent.
For example, the character U+1EBF, used in Vietnamese, has both an acute and a circumflex accent. Its canonical decomposition is the three-character sequence U+0065 U+0302 U+0301. The combining classes for the two accents are both 230, thus U+1EBF is not equivalent to U+0065 U+0301 U+0302.
Since not all combining sequences have a precomposed equivalent, even the normal form NFC is affected by combining characters' behavior.

Errors due to normalization differences

When two applications share Unicode data, but normalize them differently, errors and data loss can result. In one specific instance, OS X normalized Unicode filenames sent from the Samba file- and printer-sharing software. Samba did not recognize the altered filenames as equivalent to the original, leading to data loss. Resolving such an issue is non-trivial, as normalization is not losslessly invertible.