Binary Ordered Compression for Unicode

Binary Ordered Compression for Unicode is a MIME compatible Unicode compression scheme. BOCU-1 combines the wide applicability of UTF-8 with the compactness of Standard Compression Scheme for Unicode. This Unicode encoding is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in a Unicode Technical Note.
For comparison SCSU was adopted as standard Unicode compression scheme with a byte/code point ratio similar to language-specific code pages. SCSU has not been widely adopted, as it is not suitable for MIME “text” media types. For example, SCSU cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance. Usually, the zip, bzip2, and other industry standard algorithms compact larger amounts of Unicode text more efficiently.
Both SCSU and BOCU-1 are IANA registered charsets.

Details

All numbers in this section are hexadecimal, and all ranges are inclusive.
Code points from U+0000 to U+0020 are encoded in BOCU-1 as the corresponding byte value. All other code points are encoded as a difference between the code point and a normalized version of the most recently encoded code point that was not an ASCII space. The initial state is U+0040. The normalization mapping is as follows:

Code range	Normalized code point	Notes
`U+3040` to `U+309F`	`U+3070`	Hiragana
`U+4E00` to `U+9FA5`	`U+7711`	Unihan
`U+AC00` to `U+D7A3`	`U+C1D1`	Hangul
`U+0020`	encoder state kept as is	Space
`U+hhhh00` to `U+hhhh7F`	`U+hhhh40`	middle of 128
`U+hhhh80` to `U+hhhhFF`	`U+hhhhC0`	middle of 128

The difference between the current code point and the normalized previous code point is encoded as follows:

Difference range	Byte sequence range
`-10FF9F` to `-2DD0D`	`21` `F0` `58` `D9` to `21` `FF` `FF` `FF`
`-2DD0C` to `-2912`	`22` `01` `01` to `24` `FF` `FF`
`-2911` to `-41`	`25` `01` to `4F` `FF`
`-40` to `3F`	`50` to `CF`
`40` to `2910`	`D0` `01` to `FA` `FF`
`2911` to `2DD0B`	`FB` `01` `01` to `FD` `FF` `FF`
`2DD0C` to `10FFBF`	`FE` `01` `01` `01` to `FE` `19` `B4` `54`

Each byte range is lexicographically ordered with the following thirteen byte values excluded: 00 07 08 09 0A 0B 0C 0D 0E 0F 1A 1B 20. For example, the byte sequence FC 06 FF, coding for a difference of 1156B, is immediately followed by the byte sequence FC 10 01, coding for a difference of 1156C.
Any ASCII input U+0000 to U+007F excluding space U+0020 resets the encoder to U+0040. Because the above-mentioned values cover line end code points U+000D and U+000A as is, the encoder is in a known state at the begin of each line. The corruption of a single byte therefore affects at most one line. For comparison, the corruption of a single byte in UTF-8 affects at most one code point, for SCSU it can affect the entire document.
BOCU-1 offers a similar robustness also for input texts without the above-mentioned values with the special reset code 0xFF. When a decoder finds this octet it resets its state to U+0040 as for a line end. The use of 0xFF reset bytes is not recommended in the BOCU-1 specification, because it conflicts with other BOCU-1 design goals, notably the binary order.
The optional use of a signature U+FEFF at the begin of BOCU-1 encoded texts, i.e. the BOCU-1 byte sequence FB EE 28, changes the initial state U+0040 to U+FEC0. In other words, the signature cannot simply be stripped as in most other Unicode encoding schemes. Adding a reset byte after the signature could avoid this effect, but the BOCU-1 specification does not recommend this practice.
In theory UTF-1 and UTF-8 could encode the original UCS-4 set with 31 bits up to 7FFFFFFF. BOCU-1 and UTF-16 can encode
the modern Unicode set from U+0000 to U+10FFFF. Excluding the thirteen protected code points encoded as single octets BOCU-1 can use octets in multi-byte encodings. BOCU-1 needs at most four bytes consisting of a lead byte and one to three trail bytes. The trail bytes encode a remaining "modulo 243" difference, the lead byte determines the number of trail bytes and an initial difference.
Note that the reset byte 0xFF is not protected and can occur as trail byte.

Patent

The general BOCU algorithm is covered by United States Patent #6,737,994, which also mentions the specific BOCU-1 implementation. IBM, which employed both of the inventors of BOCU-1 at the time it was created, states in the Unicode Technical Note that implementers of a "fully compliant version of BOCU-1" must contact IBM to request a royalty-free license. BOCU-1 is the only Unicode compression scheme described on the Unicode Web site that is known to be encumbered with intellectual property restrictions.
By contrast, IBM also filed for a patent on UTF-EBCDIC, but it chose in that case to make the documentation and encoding scheme “freely available to anyone concerned towards making the transformation format as part of the UCS standards,” instead of requiring implementers to request a license.

In HTML

Supporting BOCU-1 in HTML documents is prohibited by the W3C and WHATWG HTML standards, as it would present a cross-site scripting vulnerability.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...