UTF-7

UTF-7 is a variable-length character encoding for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in Internet E-mail messages that was more efficient than the combination of UTF-8 with quoted-printable.

Motivation

, the modern standard of E-mail format, forbids encoding of headers using byte values above the ASCII range. Although MIME allows encoding the message body in various character sets, the underlying transmission infrastructure is still not guaranteed to be 8-bit clean. Therefore, a non-trivial content transfer encoding has to be applied in case of doubt. Unfortunately base64 has a disadvantage of making even US-ASCII characters unreadable in non-MIME clients. On the other hand, UTF-8 combined with quoted-printable produces a very size-inefficient format requiring 6-9 bytes for non-ASCII characters from the BMP and 12 bytes for characters outside the BMP.
Provided certain rules are followed during encoding, UTF-7 can be sent in e-mail without using an underlying MIME transfer encoding, but still must be explicitly identified as the text character set. In addition, if used within e-mail headers such as "Subject:", UTF-7 must be contained in MIME encoded words identifying the character set. Since encoded words force use of either quoted-printable or base64, UTF-7 was designed to avoid using the = sign as an escape character to avoid double escaping when it is combined with quoted-printable.
UTF-7 is generally not used as a native representation within applications as it is very awkward to process. Despite its size advantage over the combination of UTF-8 with either quoted-printable or base64, the now defunct Internet Mail Consortium recommended against its use.
8BITMIME has also been introduced, which reduces the need to encode message bodies in a 7-bit format.
A modified form of UTF-7 is currently used in the IMAP e-mail retrieval protocol for mailbox names.

Description

UTF-7 was first proposed as an experimental protocol in RFC 1642, A Mail-Safe Transformation Format of Unicode. This RFC has been made obsolete by RFC 2152, an informational RFC which never became a standard. As RFC 2152 clearly states, the RFC "does not specify an Internet standard of any kind". Despite this, RFC 2152 is quoted as the definition of UTF-7 in the IANA's list of charsets. Neither is UTF-7 a Unicode Standard. The Unicode Standard 5.0 only lists UTF-8, UTF-16 and UTF-32.
There is also a modified version, specified in RFC 2060, which is sometimes identified as UTF-7.
Some characters can be represented directly as single ASCII bytes. The first group is known as "direct characters" and contains 62 alphanumeric characters and 9 symbols: ' , -. / : ?. The direct characters are safe to include literally. The other main group, known as "optional direct characters", contains all other printable characters in the range -U+007E except ~ \ + and space. Using the optional direct characters reduces size and enhances human readability but also increases the chance of breakage by things like badly designed mail gateways and may require extra escaping when used in encoded words for header fields.
Space, tab, carriage return and line feed may also be represented directly as single ASCII bytes. However, if the encoded text is to be used in e-mail, care is needed to ensure that these characters are used in ways that do not require further content transfer encoding to be suitable for e-mail. The plus sign may be encoded as +-.
Other characters must be encoded in UTF-16, big-endian, and then in modified Base64. The start of these blocks of modified Base64 encoded UTF-16 is indicated by a + sign. The end is indicated by any character not in the modified Base64 set. If the character after the modified Base64 is a - then it is consumed by the decoder and decoding resumes with the next character. Otherwise decoding resumes with the character after the base64.

Examples

"Hello, World!" is encoded as "Hello, World+ACE-"
"1 + 1 = 2" is encoded as "1 +- 1 +AD0- 2"
"£1" is encoded as "+AKM-1". The Unicode code point for the pound sign is U+00A3, which converts into modified Base64 as in the table below. There are two bits left over, which are padded to 0.
Algorithm for encoding and decoding

Encoding

First, an encoder must decide which characters to represent directly in ASCII form, which + has to be escaped as +-, and which to place in blocks of Unicode characters. A simple encoder may encode all characters it considers safe for direct encoding directly. However the cost of ending a Unicode sequence, outputting a single character directly in ASCII and then starting another Unicode sequence is 3 to bytes. This is more than the bytes needed to represent the character as a part of a Unicode sequence. Each Unicode sequence must be encoded using the following procedure, then surrounded by the appropriate delimiters.
Using the £† character sequence as an example:

Decoding

First an encoded data must be separated into plain ASCII text chunks and nonempty Unicode blocks as mentioned in the description section. Once this is done, each Unicode block must be decoded with the following procedure

Express each Base64 code as the bit sequence it represents:

AKMgIA → 000000 001010 001100 100000 001000 000000

Regroup the binary into groups of sixteen bits, starting from the left:

000000 001010 001100 100000 001000 000000 → 0000000010100011 0010000000100000 0000

If there is an incomplete group at the end containing only zeros, discard it :

0000000010100011 0010000000100000

Each group of 16 bits is a character's Unicode number and can be expressed in other forms:

0000 0000 1010 0011 ≡ 0x00A3 ≡ 163₁₀

Unicode signature

A Unicode signature is an optional special byte sequence at the very start of a stream or file that, without being data itself, indicates the encoding used for the data that follows; a signature is used in the absence of metadata that denotes the encoding. For a given encoding scheme, the signature is that scheme's representation of Unicode code point U+FEFF, the so-called BOM .
While a Unicode signature is typically a single, fixed byte sequence, the nature of UTF-7 necessitates 5 variations: The last 2 bits of the 4th byte of the UTF-7 encoding of U+FEFF belong to the following character, resulting in 4 possible bit patterns and therefore 4 different possible bytes in the 4th position. The 5th variation is needed to disambiguate the case where no characters at all follow the signature. See the UTF-7 entry in the table of Unicode signatures.

Use on the web

In December 2018, UTF-7 was estimated to be used by less than 0.003% of sites on the World Wide Web, where UTF-8 has since 2009 been the dominant character encoding.

Security

UTF-7 allows multiple representations of the same source string. In particular, ASCII characters can be represented as part of Unicode blocks. As such, if standard ASCII-based escaping or validation processes are used on strings that may be later interpreted as UTF-7, then Unicode blocks may be used to slip malicious strings past them. To mitigate this problem, systems should perform decoding before validation and should avoid attempting to autodetect UTF-7.
Older versions of Internet Explorer can be tricked into interpreting the page as UTF-7. This can be used for a cross-site scripting attack as the < and > marks can be encoded as +ADw- and +AD4- in UTF-7, which most validators let through as simple text.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...