Cangjie input method
The Cangjie input method is a system by which Chinese characters may be entered into a computer using a standard keyboard. Invented in 1976 by Chu Bong-Foo, the method is named after Cangjie, the mythological inventor of the Chinese writing system; the name was suggested by Chiang Wei-kuo, then Defence Minister of Taiwan. Although the input method was initially based upon traditional Chinese characters, it has since been revamped so that Cangjie and the simplified Chinese character set can interact.
Municipal Services Building, Hong Kong with Cangjie hints printed on the lower-left corners of the keys.
Cangjie is the first Chinese input method that uses the QWERTY keyboard. Other earlier methods use large keyboards with 40 to 2400 keys, except the Four-Corner Method which uses only the number keys. Chu saw that the QWERTY keyboard had become international standard, and believed therefore that Chinese-language input had to be based on it.
Chu Bong-Foo released the patent of Cangjie in 1982 as he thought that it should belong to the Chinese cultural heritage. Therefore, Cangjie has become open source software—free for anyone to use and modify—making Cangjie ubiquitous on every computer system that supports traditional Chinese.
In filenames and elsewhere, the name Cangjie is sometimes abbreviated as cj.
Unlike pinyin, Cangjie is based on the graphological aspect of the characters: each basic, graphical unit is represented by a basic character component, 24 in all, each mapped to a particular letter key on a standard QWERTY keyboard. An additional, "difficult character" function is mapped to the X key. Within the keystroke-to-character representations, there are four subsections of characters: the Philosophical Set, the Strokes Set, the [|Body-Related Set], and the Shapes Set.
The basic character components in Cangjie are usually called "radicals"; nevertheless, Cangjie decomposition is not based on traditional Kangxi radicals, nor is it based on standard stroke order; it is in fact a simple geometric decomposition.
Overview of the input method
The keys and "radicals"
The basic character components in Cangjie are called "radicals" or "letters". There are 24 radicals but 26 keys; the 24 radicals are associated with roughly 76 auxiliary shapes, which in many cases are either rotated or transposed versions of components of the basic shapes. For instance, the letter A can represent either itself, the slightly wider 曰, or a 90° rotation of itself.Group | Key | Name | Primary meaning |
Philosophical group | A | 日 sun | 日, 曰, 90° rotated 日 |
Philosophical group | B | 月 moon | the top four strokes of 目, 冂, 爫, 冖, the top and top-left part of 炙, 然, and 祭, the top-left four strokes of 豹 and 貓, and the top four strokes of 骨 |
Philosophical group | C | 金 gold | itself, 丷, 八, and the penultimate two strokes of 四 and 匹 |
Philosophical group | D | 木 wood | itself, the first two strokes of 寸 and 才, the first two strokes of 也 and 皮 |
Philosophical group | E | 水 water | 氵, the last five strokes of 暴 and 康, 又 |
Philosophical group | F | 火 fire | the shape 小, 灬, the first three strokes in 當 and 光 |
Philosophical group | G | 土 land | itself, or 士 for soldier |
Stroke group | H | 竹 bamboo | The slant and short slant, the Kangxi radical 竹, namely the upper parts in 笨 and 節 |
Stroke group | I | 戈 dagger axe | The dot, the first three strokes in 床 and 庫, and the shape 厶 |
Stroke group | J | 十 ten | The cross shape and the shape 宀 |
Stroke group | K | 大 big | The X shape, including 乂 and the first two strokes of 右, as well as 疒 |
Stroke group | L | 中 centre | The vertical stroke, as well as 衤 and the first four strokes of 書 and 盡 |
Stroke group | M | 一 one | The horizontal stroke, as well as the final stroke of 孑 and 刁, the shape 厂, and the shape 工 |
Stroke group | N | 弓 bow | The crossbow and the hook |
Body parts group | O | 人 person | The dismemberment, the Kangxi radical 人, the first two strokes of 丘 and 乓, the first two strokes of 知, 攻, and 氣, and the final two strokes of 兆 |
Body parts group | P | 心 heart | The Kangxi radical 忄, the second stroke in 心, the last four strokes in 恭, 慕, and 忝, the shape 匕, the shape 七, the penultimate two strokes in 代, and the shape 勹 |
Body parts group | Q | 手 hand | The Kangxi radical 手 |
Body parts group | R | 口 mouth | The Kangxi radical 口 |
Character shapes group | S | 尸 corpse | 匚, the first two strokes of 己, the first stroke of 司 and 刀, the third stroke of 成 and 豕, the first four strokes of 長 and 髟 |
Character shapes group | T | 廿 twenty | Two vertical strokes connected by a horizontal stroke; the Kangxi radical 艸 when written as 艹 |
Character shapes group | U | 山 mountain | Three-sided enclosure with an opening on the top |
Character shapes group | V | 女 woman | A hook to the right, a V shape, the last three strokes in 艮, 衣, and 長 |
Character shapes group | W | 田 field | Itself, as well as any four-sided enclosure with something inside it, including the first two strokes in 母 and 毋 |
Character shapes group | Y | 卜 fortune telling | The 卜 shape and rotated forms, the shape 辶, the first two strokes in 斗 |
Collision/Difficult key* | X | 重/難 collision/difficult | disambiguation of Cangjie code decomposition collisions, code for a "difficult-to-decompose" part |
Special character key* | Z | Auxiliary code used for entering special characters. In most cases, this key combined with other keys will produce Chinese punctuations. Note: Some variants use Z as a collision key instead of X. In those systems, Z has the name "collision" and X has the name "difficult" ; but the use of Z as a collision key is neither in the original Cangjie nor used in the current mainstream implementations. In other variants, Z may have the name "user-defined" or some other name. | |
Wildcard | Shift + 8 | Wildcard | It can replace any key from 2nd to 5th place, and return a list matches the combination. It is very useful for unknown guesses when you are sure about the first and last input. |
The auxiliary shapes of each Cangjie radical have changed slightly between different versions of the Cangjie method; this is one reason that different versions of the Cangjie method are not completely compatible.
Chu Bong-Foo has provided alternative names for some letters according to their characteristics, for example H is also called 斜 which means slant. The names form a rhyme for learners to memorize the letters, each group in a line:
Keyboard layout
The basic rules
The typist must be familiar with several decomposition rules 拆字規則 that define how to analyse a character to arrive at a Cangjie code.- Direction of decomposition: left to right, top to bottom, and outside to inside
- Geometrically connected forms: take 4 Cangjie codes, namely the first, second, third, and last codes
- Geometrically unconnected forms that can be broken into two subforms : identify the two geometrically connected subforms according to the direction of decomposition rules, then take the first and last codes of the first subform and the first, second, and last code of the second subform.
- Geometrically unconnected forms that can be broken into multiple subforms : identify the first geometrically connected subform according to the direction of decomposition rules and take the first and last codes of that form. Next, break the remainder into subforms and take the first and last codes of the first subform and the last code of the last subform.
- Conciseness — if two decompositions are possible, the shorter decomposition is correct
- Completeness — if two decompositions of the same length are possible, the one that identifies a more complex form first is the correct decomposition
- Reflection of the form of the radical — the decomposition should reflect the shape of the radical, meaning using the same code twice or more should be avoided if possible, and the shape of the character should not be "cut" at a corner in the form
- Omission of codes
- * Partial omission — when the number of codes in a complete decomposition would exceed the permitted number of codes, the extra codes are ignored
- * Omission in enclosed forms — when part of the character to be decomposed and the form is an enclosed form, only the shape of the enclosure is decomposed; the enclosed forms are omitted
Examples
- 車
- * This character is geometrically connected, consisting of one part with a vertical structure, so we take the first, second, and last Cangjie codes from top to bottom.
- * The Cangjie code is thus 十 田 十, corresponding to the basic shapes of the codes in this example.
- 謝
- * This character consists of geometrically unconnected parts arranged horizontally. For the initial decomposition, we treat it as two parts, 言 and 射.
- * The first part, 言, is geometrically unconnected from top to bottom; we take the first and last parts and arrive at 卜 口.
- * The second part is again geometrically unconnected, arranged horizontally. The two parts are 身 and 寸.
- ** For the first part of this second part, 身, we take the first and last codes. Both are slants and therefore H; the first and last codes are thus 竹 竹.
- ** For the second part of the original second part, 寸, we take only the last part. Because this is geometrically unconnected and consists of two parts, the first part is the outer form while the second part is the dot in the middle. The dot is I, and therefore the last code is 戈.
- * The Cangjie code is thus 卜 口 竹 竹 戈, or 卜 口 竹 竹 戈.
- 谢
- * This example is identical to the above, except that the first part is 讠; the first and last codes are 戈 and 女
- * Repeating the same steps as in the above example, we get 戈 女 竹 竹 戈, or 戈 女 竹 竹 戈.
The short list of exceptions
whether the rules say they should be decomposed this way or not.
The number of such exceptions is small:
Some forms cannot be decomposed. They are represented by an X.
Form | Fixed decomposition |
臼 | HX |
與 | HXYC |
興 | HXBC |
盥 | HXBT |
姊 | VLXH |
齊 | YX |
兼 | TXC |
鹿 | IXP |
身 | HXH |
卍 | NX |
黽 | RXU |
龜 | NXU |
廌 | IXF |
慶 | IXE |
淵 | ELXL |
肅 | LX |
Early Cangjie system
In the beginning, the Cangjie input method was not a way to produce a character in any character set. It was, instead, an integrated system consisting of the Cangjie input rules and a Cangjie controller board. The controller board contains character generator firmware, which dynamically generates Chinese characters from Cangjie codes when characters are output, using the hi-res graphics mode of an Apple II computer.In the preface of the [|Cangjie user's manual], Chu Bong-Foo wrote in 1982:
In this early system, when the user types "yk " to get the Chinese character 文, the Cangjie codes do not get converted to any character encoding; the actual string "yk " is stored. In a very real sense, the Cangjie code of each character the encoding of that particular character.
. Other characters are never recorded.
A particular "feature" of this early system is that if you send random lowercase words to the character generator, it will attempt to construct Chinese characters according to the Cangjie decomposition rules, sometimes causing strange, unknown characters to appear. This unusual feature, "automatic generation of characters", is actually described in the manual and is responsible for producing [|more than 10,000 of the about 15,000 characters] that the system can handle. The name Cangjie, evocative of creation of new characters, was actually very apt for this early version of Cangjie.
The presence of the integrated character generator also explains the historical necessity for the existence of the "X" key used for disambiguation of decomposition collisions: because characters are "chosen" when the codes are output, every character that can be displayed must in fact have a unique Cangjie decomposition. It would not make sense—nor would it be practical—for the system to provide a choice of candidate characters when some random text file is displayed; the user would not know which of the candidates are correct.
Issues
Cangjie was designed to be an easy-to-use system to help promote the use of Chinese computing; nevertheless, many users find Cangjie to be a difficult method. Many of the perceived difficulties arise from poor instruction.Perceived difficulties
- In order to input using Cangjie, one must know not only the names of the radicals, but also all their auxiliary shapes. It is common to find tables of the Cangjie radicals with their auxiliary shapes taped onto the monitors of casual computer users.
- One must also be familiar with the decomposition rules; a lot of casual computer users are not even aware of the existence of decomposition rules but rather type by guessing. This makes Cangjie difficult for them.
A typist with sufficient practice in Cangjie touch-types, much like one typing English.
It is entirely possible for a touch-typist to type at 25 Chinese characters per minute or better in Cangjie,
yet have difficulty remembering the list of auxiliary shapes or even the decomposition rules.
Experienced Cangjie typists can reportedly attain a typing speed between 60 cpm and over 200 cpm.
There are also difficulties arising from current implementations of Cangjie:
- The decomposition of a character depends on a predefined set of "standard shapes" ; however, because Cangjie is used in many different countries, the standard shape of a certain character in Cangjie is not always the standard shape the user has learnt for that character. Learning Cangjie then entails learning not only Cangjie itself, but also unfamiliar standard shapes for some characters. The Cangjie IME does not handle mistakes in decomposition except by telling the user that there is a mistake.
- Punctuation marks are not geometrically decomposed, but rather given random-looking codes that begin with ZX followed by a string of three letters related to the ordering of the characters in the Big5 code. Typing punctuation marks in Cangjie thus becomes a frustrating exercise in either memorization or pick-and-peck. However, this is solved on modern system through accessing a virtual keyboard on screen
Actual difficulties
- Cangjie is error-intolerant : Cangjie does not include any commonly made errors as alternative codes. For example, if one does not decompose 方 from top to bottom into YHS, but instead type YSH like the stroke order, Cangjie does not return the character 方 as a choice.
- The user cannot type a character which they have forgotten how to write. This, of course, is not a specific problem with Cangjie but rather a problem universal to all non-phonetic input methods.
Versions of Cangjie
The Cangjie input method is commonly said to have gone through five generations, each of which is slightly incompatible with the others. Currently, version 3 is the most common; it is the version of Cangjie supported natively by Microsoft Windows. Version 5, supported by the Free Cangjie IME and previously the only Cangjie supported by SCIM, represents a significant minority method and supported by iOS.The early Cangjie system supported by the Zero One card on the Apple II was Version 2; Version 1 was never released.
The Cangjie input method supported on the classic Mac OS is somewhat like Version 3 and somewhat like Version 5.
Version 5, like the original Cangjie input method, was created directly by Chu, the inventor. Chu had hoped that the release of Version 5, originally slated to be Version 6, would bring an end to the “more than ten versions of Cangjie input method”.
Version 6 has not yet been released to the public, but is being used to create a database which can accurately store every historical Chinese text.
Variants of Cangjie
Most modern implementations of Cangjie IMEs provide various convenient features:- Some IMEs list all characters beginning with the code you have typed; for example, if you type A, the system gives you all characters whose Cangjie code begins with A so that you can select the correct character if it is on the screen; if you type another A, the list is shortened to give all characters whose code begins with AA. Examples of such implementations include the IME in Mac OS X, and SCIM.
- Some IMEs provide one or more wildcard keys, usually but not always * and/or ?, that allows the user to omit part of the Cangjie code; the system will display a list of matching characters for the user to choose. Examples include xcin, SCIM, and the IME in the Founder typesetting systems. Microsoft Windows's "standard" Changjie IME allows * to substitute for in-between characters, while New Changjie IME allows * as a wildcard anywhere except for the first character.
- Some IMEs provide an "abbreviation" feature, where impossible Cangjie codes are interpreted as abbreviations for the Cangjie codes of more than one character. This allows more characters to be input with fewer keys. An example is SCIM.
- Some IMEs provide an "association" feature, where the system anticipates what you are going to type next, and provides you with a list of characters or even phrases associated with what the user has typed. An example is the Microsoft Changjie IME.
- Some IMEs present the list of candidate characters differently depending on the frequency of character use. An example is the Cangjie IME in NJStar.
There have also been various attempts to "simplify" Cangjie one way or another:
- Simplified Cangjie or has the same radicals, auxiliary shapes, decomposition rules, and short list of exceptions as Cangjie, but only the first and last codes are used if more than two codes are required in Cangjie.
Applications
One direct application of the use of decomposed characters is the possibility of computing the similarities in writing Chinese characters, e.g. the Cangjie input method offers a good starting point for this kind of application. By relaxing the limit of five codes for each Chinese character and adopting more detailed Cangjie codes for each character, we can compute visually similar characters. Integrating this with pronunciation information enables computer-assisted learning of Chinese characters.