Extended Unix Code

Extended Unix Code

Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.

The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 (942) characters, or 830584 (943) characters, as sequences of 7-bit codes. Only ISO-2022 compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with the EUC scheme. G0 is almost always an ISO-646 compliant coded character set (e.g. US-ASCII/KS X 1003/ISO 646:KR in EUC-KR and US-ASCII/the "lower half" of JIS X 0201 in EUC-JP) that is invoked on GL (i.e. with the most significant bit cleared).

To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO-646 code or the ISO-2022 (EUC) code.

The most commonly-used EUC codes are variable-width encodings with a character belonging to G0 (ISO-646 compliant coded character set) taking one byte and a character belonging to G1 (taken by a 94x94 coded character set) represented in two bytes. The EUC-CN form of GB2312 and EUC-KR are examples of such two-byte EUC codes. EUC-JP includes characters represented by up to three bytes whereas a single character in EUC-TW can take up to four bytes.

EUC-CN

EUC-CN is the usual way to use the GB2312 standard for simplified Chinese characters. Unlike the case of Japanese, the ISO-2022 form of GB2312 is not normally used, though a variant form called HZ was sometimes used on USENET.

EUC-CN can also be used to encode the Unicode-based GB18030 character encoding, which includes traditional characters, although GB18030 is more frequently used without EUC encoding, since GB18030 is already a Unicode encoding. However, GB18030 encoded in EUC-CN is a variable-width encoding, because GB18030 contains more than 8836 (94×94) characters.

Related encoding systems

An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). The 748 code contains all of GB2312, but is not ISO 2022–compliant and therefore not a true EUC code. (It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is therefore more similar in structure to Big5 and other non–ISO 2022–compliant DBCS encoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting.

EUC-JP

EUC-JP is a variable-width encoding used to represent the elements of three Japanese character set standards, namely JIS X 0208, JIS X 0212, and JIS X 0201.

* A character from JIS-X-0208 (code set 1) is represented by two bytes, both in the range 0xA1 – 0xFE.
* A character from JIS-X-0212 (code set 3) is represented by three bytes, the first being 0x8F, the following two in the range 0xA1 – 0xFE.
* A character from the "upper half" of JIS-X-0201 (half-width kana, code set 2) is represented by two bytes, the first being 0x8E, the second in the range 0xA1 – 0xDF.
* A character from the "lower half" of JIS-X-0201 (ASCII, code set 0) is represented by one byte, in the range 0x21 – 0x7E.

This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by ISO-2022-JP, which is based on the same character set standards.

In Japan, the EUC-JP encoding is heavily used by Unix or Unix-like operating systems (except for HP-UX), while Shift_JIS or its extensions (Windows code page 932 and MacJapanese) are used on other platforms. Therefore, whether Japanese web sites use EUC-JP or Shift_JIS often depends on what OS the author uses.

EUC-JISX0213 is similar to but different from EUC-JP in that two planes of JIS-X-0213 take place of JIS-X-0208 and JIS-X-0212. There is a similar relationship between Shift_JIS and Shift-JISX0213.

EUC-KR

EUC-KR is a variable-width encoding to represent Korean text using two coded character sets, KS X 1001 (formerly KS C 5601)cite web |url=http://examples.oreilly.com/cjkvinfo/AppL/ksx1001.pdf |title=KS X 1001:1992] cite web |url=http://www.itscj.ipsj.or.jp/ISO-IR/149.pdf |title=KS C 5601:1987|date=1988-10-01] and KS X 1003 (formerly KS C 5636)/ISO 646:KR/US-ASCII. KS X 2901 (formerly KS C 5861) stipulates the encoding and RFC 1557 dubbed it as EUC-KR. A character drawn from KS X 1001 (G1, code set 1) is encoded as two bytes in GR (0xA1-0xFE) and a character from KS X 1003/US-ASCII (G0, code set 0) takes one byte in GL (0x21-0x7E).

It is the most widely used legacy character encoding in Korea on all three major platforms (Unix-like OS, Windows and Mac), but its use has been very slowly decreasing as UTF-8 gains popularity, especially on Linux and Mac OS X. It is usually referred to as Wansung (완성) in South Korea. The default Korean codepage for Windows (code page 949) is a proprietary, but upward compatible extension of EUC-KR referred to as Unified Hangeul Code (통합 완성형, Tonghab Wansunghyung). Mac Korean used in classic Mac OS is also compatible with EUC-KR.

EUC-TW

EUC-TW is a variable-width encoding that supports US-ASCII and 16 planes of CNS 11643, each of which is 94x94. It is a rarely used encoding for traditional Chinese characters as used on Taiwan. Big5 is much more common. A character in US-ASCII (G0, code set 0) is encoded as a single byte in GL( 0x21-0x7E) and a character in CNS 11643 plane 1 (code set 1) is encoded as two bytes in GR (0xA1-0xFE). A character in plane 1 through 16 of CNS 11643 (code set 2) is encoded as four bytes with the first byte always being 0x8E(Single Shift 2) and the second byte indicating the plane (the plane number is obtained by subtracting 0xA0 from the second byte). The third and fourth bytes are in GR (0xA1-0xFE). Note that the plane 1 of CNS 11643 is encoded twice as code set 1 and a part of code set 2.

ee also

*CJK
*Japanese language and computers
*Korean language and computers
*Chinese character encoding

References

External links

* [http://www.rikai.com/library/kanjitables/kanji_codes.euc.shtml EUC-JP codeset table] (non-ascii part)
* [http://developers.sun.com/dev/gadc/technicalpublications/articles/gb18030.html GB18030-2000 — The New Chinese National Standard]
* [http://www.jagat.or.jp/asia/report/China3.htm The New Generation of Pre-Press Software in China] — mentions the 748 code
* [http://www.cns11643.gov.tw/web/word.jsp#euc Description of the EUC-TW code] (in Chinese)
* [http://search.cpan.org/~dankogai/Encode-JIS2K-0.02/JIS2K.pm Manual page of EUC-JISX0213] in Perl Encode module
* [http://www.opengroup.or.jp/jvc/cde/euc-e.html EUC-JP code range chart] at Opengroup Japan
* [http://www.itscj.ipsj.or.jp/ISO-IR/2-4.htm International Register of Coded Character Sets] — The coded character sets of China, Japan, South Korea, North Korea and Taiwan (ISO/IEC)
* [http://examples.oreilly.com/cjkvinfo/doc/cjk.inf Chinese, Japanese, and Korean character set standards and encoding systems]


Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • Extended UNIX Code — Extended UNIX Coding (Abkürzung EUC) ist eine 8 Bit Zeichencodierung, die vor allem für Chinesisch, Japanisch und Koreanisch gebraucht wird. EUC ist eine Sammelbezeichnung für verschiedene Kodierungen, die je nach Land bis zu vier… …   Deutsch Wikipedia

  • Extended Unix Code — Extended UNIX Coding (Abkürzung EUC) ist eine 8 Bit Zeichencodierung, die vor allem für Chinesisch, Japanisch und Koreanisch gebraucht wird. EUC ist eine Sammelbezeichnung für verschiedene Kodierungen, die je nach Land bis zu vier… …   Deutsch Wikipedia

  • Extended UNIX Coding — (Abkürzung EUC) ist eine 8 Bit Zeichencodierung, die vor allem für Chinesisch, Japanisch und Koreanisch gebraucht wird. EUC ist eine Sammelbezeichnung für verschiedene Kodierungen, die je nach Land bis zu vier unterschiedliche Zeichensätze… …   Deutsch Wikipedia

  • Extended Unix Coding — (EUC) est un codage de caractères sur 8 bits utilisé premièrement par le japonais et le coréen. Au Japon, ce codage est intensivement utilisé par les systèmes d exploitation de type Unix, mais est rarement utilisé ailleurs. EUC est cependant le… …   Wikipédia en Français

  • Unix time — Unix time, or POSIX time, is a system for describing points in time, defined as the number of seconds elapsed since midnight Coordinated Universal Time (UTC) of January 1 1970, not counting leap seconds. It is widely used not only on Unix like… …   Wikipedia

  • Extended file attributes — is a file system feature that enables users to associate computer files with metadata not interpreted by the filesystem, whereas regular attributes have a purpose strictly defined by the filesystem (such as permissions or records of creation and… …   Wikipedia

  • Extended display identification data — (EDID) is a data structure provided by a computer display to describe its capabilities to a graphics card. It is what enables a modern personal computer to know what kind of monitor is connected. EDID is defined by a standard published by the… …   Wikipedia

  • Su (UNIX) — su (сокр. от англ. Substitute User)  команда Unix подобных операционных систем, позволяющая пользователю войти в систему под другим именем, не завершая текущий сеанс. Обычно используется для временного входа Содержание 1 Синтаксис 2 Защита 3… …   Википедия

  • Unix File System — Infobox filesystem full name = UNIX file system name = UFS developer = CSRG introduction os = 4.2BSD introduction date = partition id = directory struct = table file struct = bad blocks struct = max file size = 2^73 bytes (8 ZiB) max files no =… …   Wikipedia

  • ASCII-Code — American Standard Code for Information Interchange (ASCII, alternativ US ASCII, oft [æski] ausgesprochen) ist eine 7 Bit Zeichenkodierung und bildet die US Variante von ISO 646 sowie die Grundlage für spätere mehrbittige Zeichensätze und… …   Deutsch Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”