Wide character

Wide character

A wide character is a computer character datatype that generally has a size greater than the traditional 8-bit character. The increased datatype size allows for the use of larger coded character sets.

Contents

History

During the 1960s, mainframe and mini-computer manufacturers began to standardize around the 8-bit byte as their smallest datatype. Meanwhile, the 7-bit ASCII character set became the industry standard method for encoding alphanumeric characters for teletype machines and computer terminals. As a result, the 8-bit byte became the de facto datatype for computer systems storing ASCII characters in memory.

Later, computer manufacturers began to make use of the spare bit to extend the ASCII character set beyond its limited set of English alphabet characters. 8-bit extensions such as IBM code page 37, PETSCII and ISO 8859 became commonplace, offering terminal support for Greek, Cyrillic, and many others. However, such extensions were still limited in that they were region specific and often could not be used in tandem. Special conversion routines had to be used to convert from one character set to another, often resulting in destructive translation when no equivalent character existed in the target set.

In 1989, the International Organization for Standardization began work on the Universal Character Set (UCS), a multilingual character set that could be encoded using either a 16-bit (2-byte) or 32-bit (4-byte) value. These larger values required the use of a datatype larger than 8-bits to store the new character values in memory. Thus the term wide character was used to differentiate them from traditional 8-bit character datatypes.

Relation to UCS and Unicode

A wide character refers to the size of the datatype in memory. It does not state how each value in a character set is defined. Those values are instead defined using character sets, with UCS and Unicode simply being two common character sets that contain more characters than an 8-bit value would allow.

Relation to multibyte characters

Just as earlier data transmission systems suffered from the lack of an 8-bit clean data path, modern transmission systems often lack support for 16-bit or 32-bit data paths for character data. This has led to character encoding systems such as UTF-8 that can use multiple bytes to encode a value that is too large for a single 8-bit symbol.

Size of a wide character

The Microsoft Windows application programming interfaces Win32 and Win64, as well as the Java and .Net Framework platforms, require that wide character variables be defined as 16-bit values, and that characters be encoded using UTF-16 (due to former use of UCS-2), while modern Unix-like systems generally require 32-bit values encoded using UTF-32.

Programming specifics

C/C++

The ISO/IEC C programming language, which generally uses the datatype wchar_t for wide characters, originally defined that wide characters should be 16-bit values under C90 due to historical compatibility reasons. C and C++ compilers that comply with the 10646-1:2000 Unicode standard generally assume 32-bit values. However, the ISO/IEC 10646:2003 Unicode standard 4.0 says that:

"ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension."

and that

"The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers."

In ANSI C library header files, wchar.h and wctype.h deal with the wide characters. Additional functions can also be found in stdio.h and stdlib.h.

Wide characters and long strings must use the prefix L when defined in quotes. Some examples are:

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
 
int main() {
  setlocale(LC_ALL,"");
  wchar_t myChar1 = L'Ω';
  wchar_t myChar2 = 0x2126;  // hexadecimal encoding of char Ω using UTF-16
  wchar_t myString1[] = L"♠♣♥♦";
  wchar_t myString2[] = { 0x2660, 0x2661, 0x2662, 0x2663, 0x0000 };
  // hex encoding of null-terminated string ♠♣♥♦ using UTF-16
 
  wprintf(L"This is char: %lc \n",myChar1);
  wprintf(L"This is char: %lc \n",myChar2);
  wprintf(L"This is a long string: %ls \n",myString1);
  wprintf(L"This is a long string: %ls \n",myString2);
}

Python

According to Python's documentation, the language sometimes uses wchar_t as the basis for its character type Py_UNICODE. It depends on whether wchar_t is "compatible with the chosen Python Unicode build variant" on that system.[1]

References

External links


Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • wide character string — plačioji eilutė statusas T sritis informatika apibrėžtis Unikodu koduotų ženklų eilutė. Įvardijama tada, kai reikia atskirti nuo įprastų „siaurų“ eilučių, kurios ženklai koduojami vienu baitu. Operavimas plačiomis ir ilgomis eilutėmis yra… …   Enciklopedinis kompiuterijos žodynas

  • Character displacement — refers to the phenomenon where differences among similar species whose distributions overlap geographically are accentuated in regions where the species co occur but are minimized or lost where the species’ distributions do not overlap. This… …   Wikipedia

  • Character education — is an umbrella term loosely used to describe the teaching of children in a manner that will help them develop variously as moral, civic, good, mannered, behaved, non bullying, healthy, critical, successful, traditional, compliant and/ or socially …   Wikipedia

  • Character amnesia — (simplified Chinese: 提笔忘字; traditional Chinese: 提筆忘字; pinyin: tíbĭwàngzì; literally pick up pen, forget the character ) is a phenomenon whereby experienced speakers of some East Asian languages forget how to write Chinese characters previously… …   Wikipedia

  • Wide Angle (TV series) — Wide Angle Genre Documentary television series Created by Stephen Segaller Presented by Aaron Brown Narrated by Jay O. Sanders …   Wikipedia

  • Character blogging — Character blogs are a type of blog written as though a fictional character, rather than an actual person, is making the blog post. There are many character blogs on the Internet, and it has recently become popular among TV show producers as a… …   Wikipedia

  • Wide boy — is a British term for a man who lives by his wits, wheeling and dealing. According to the Oxford English Dictionary it is synonymous with spiv. Newspapers of the late 1940s and 1950s often use both terms in the same article about the same person… …   Wikipedia

  • Character mask — Part of a series on Marxism …   Wikipedia

  • Character encodings in HTML — For a list of character entity references, see List of XML and HTML character entity references. HTML HTML and HTML5 Dynamic HTML XHTML XHTML Mobile Profile and C HTML Canvas element Character encodings Document Object Model Font family HTML… …   Wikipedia

  • Character entity reference — In the markup languages SGML, HTML, XHTML and XML, a character entity reference is a reference to a particular kind of named entity that has been predefined or explicitly declared in a Document Type Definition (DTD). The replacement text of the… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”