Unicode and HTML

Unicode and HTML

Web pages authored using hypertext markup language (HTML) may contain multilingual text represented with the Unicode universal character set.

The relationship between Unicode and HTML tends to be a difficult topic for many computer professionals, document authors, and web users alike. The accurate representation of text in web pages from different natural languages and writing systems is complicated by the details of character encoding, markup language syntax, font, and varying levels of support by web browsers.

HTML document characters

Web pages are typically HTML or XHTML documents. Both types of documents consist, at a fundamental level, of characters, which are graphemes and grapheme-like units, independent of how they manifest in computer storage systems and networks.

An HTML document is a sequence of Unicode characters. More specifically, HTML 4.0 documents are required to consist of characters in the HTML "document character set": a character repertoire wherein each character is assigned a unique, non-negative integer "code point". This set is defined in the HTML 4.0 DTD, which also establishes the syntax (allowable sequences of characters) that can produce a valid HTML document. The HTML document character set for HTML 4.0 consists of most, but not all, of the characters jointly defined by Unicode and ISO/IEC 10646: the Universal Character Set (UCS).

Like HTML documents, an XHTML document is a sequence of Unicode characters. However, an XHTML document is an XML document, which, while not having an explicit "document character" layer of abstraction, nevertheless relies upon a similar definition of permissible characters that cover most, but not all, of the Unicode/UCS character definitions. The sets used by HTML and XHTML/XML are slightly different, but these differences have little effect on the average document author.

Regardless of whether the document is HTML or XHTML, when stored on a file system or transmitted over a network, the document's characters are "encoded" as a sequence of bit octets ("bytes") according to a particular character encoding. This encoding may either be a Unicode Transformation Format, like UTF-8, that can directly encode any Unicode character, or a legacy encoding, like Windows-1252, that can't.

Numeric character references

In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a numeric character reference: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. A character reference takes the form &#N;, where N is either a decimal number for the Unicode code point, or a hexadecimal number, in which case it must be prefixed by x. The characters that comprise the numeric character reference are universally representable in every encoding approved for use on the Internet.

For example, a Unicode code point like U+53F6, which corresponds to a particular Chinese character, has to be converted to a decimal number, preceded by &# and followed by ;, like this: 叶, which produces this: 叶 (if it doesn't look like a Chinese character, see the special characters note at bottom of article).

The support for hexadecimal in this context is more recent, so older browsers might have problems displaying characters referenced with hexadecimal numbers—but they will probably have a problem displaying Unicode characters above code point 255 anyway. To ensure better compatibility with older browsers, it is still a common practice to convert the hexadecimal code point into a decimal value (for example 叶 instead of 叶).

Named character entities

In HTML there is a standard set of 252 named "character entities" for characters — some common, some obscure — that are either not found in certain character encodings or are markup sensitive in some contexts (for example angle brackets and quotation marks). Although any Unicode character can be referenced by its numeric code point, some HTML document authors prefer to use these named entities instead, where possible, as they are less cryptic and were better supported by early browsers.

Character entities can be included in an HTML document via the use of "entity references", which take the form &EntityName;, where EntityName is the name of the entity. For example, —, much like — or —, represents U+2014: the em dash character — like this — even if the character encoding used doesn't contain that character.

Character encoding determination

In order to correctly process HTML, a web browser must ascertain which Unicode characters are represented by the encoded form of an HTML document. In order to do this, the web browser must know what encoding was used. When a document is transmitted via a MIME message or a transport that uses MIME content types such as an HTTP response, the message may signal the encoding via a Content-Type header, such as Content-Type: text/html; charset=ISO-8859-1. Other external means of declaring encoding are permitted but rarely used. The encoding may also be declared within the document itself, in the form of a META element, like <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">. This requires an extension of ASCII to be used, like UTF-8. When there is no encoding declaration, the default varies depending on the localisation of the browser.

For a system set up mainly for Western European languages, it will generally be ISO-8859-1 or its close relation Windows-1252. For a browser from a location where multibyte character encodings are the norm, some form of autodetection is likely to be applied.

Because of the legacy of 8-bit text representations in programming languages and operating systems and the desire to avoid burdening users with the need to understand the nuances of encoding many text editors used by HTML authors are unable or unwilling to offer a choice of encodings when saving files to disk and often do not even allow input of characters beyond a very limited range. Consequently many HTML authors are unaware of encoding issues and may not have any idea what encoding their documents actually use. It is also a common misunderstanding that the encoding declaration effects a change in the actual encoding - whereas it is actually just a label that could be inaccurate.

Many HTML documents are served with inaccurate encoding declarations, or no declarations at all. In order to determine the encoding in such cases, many browsers allow the user to manually select one from a list. They may also employ an encoding autodetection algorithm that works in concert with the manual override. The manual override may apply to all documents, or only those for which the encoding cannot be ascertained by looking at declarations and/or byte patterns. The fact that the manual override is present and widely used hinders the adoption of accurate encoding declarations on the Web; therefore the problem is likely to persist. This has been addressed somewhat by XHTML, which, being XML, requires that encoding declarations be accurate and that no workarounds be employed when they're found to be inaccurate.

Web browser support

Many browsers are only capable of displaying a small subset of the full Unicode repertoire. Here is how your browser displays various Unicode code points:

Some web browsers, such as Mozilla Firefox, Opera, and Safari, are able to display multilingual web pages by intelligently choosing a font to display each individual character on the page. They will correctly display any mix of Unicode blocks, as long as appropriate fonts are present in the operating system.

Internet Explorer version 6 for Windows is capable of displaying the full range of Unicode characters, but characters which are not present in the first available font specified in the web page will only display if they are present in the designated fallback font for the current international script [Microsoft (2006), “ [http://www.microsoft.com/globaldev/getwr/steps/wrg_font.mspx Globalization Step-by-Step: Fonts] ” at "Microsoft Global Development and Computing Portal". URL retrieved on 2006-04-26.] (for example, only Arial font will be considered for a block beginning with Latin text, or Arial Unicode MS if it is also installed; subsequent fonts specified in a list are ignored). [Girt By Net (2005), “ [http://girtby.net/archives/2005/10/07/internet-explorer-makes-me/ Internet Explorer Makes Me ☹] ” at "girtby.net". URL retrieved on 2006-04-26.] Otherwise, Internet Explorer will display placeholder squares. For characters not present in a web page's fonts, Web page authors must guess which other appropriate fonts might be present on users' systems, and manually specify them as the preferred choices for each block or range of text containing such characters—Microsoft recommends using CSS to specify a font for each block of text in a different language or script. The characters in the table above haven't been assigned specific fonts, yet most should render correctly if appropriate fonts have been installed.

Older browsers, such as Netscape Navigator 4.77, can only display text supported by the current font associated with the character encoding of the page, and may misinterpret numeric character references as being references to code values within the current character encoding, rather than references to Unicode code points. When you are using such a browser, it is unlikely that your computer has all of those fonts, or that the browser can use all available fonts on the same page. As a result, the browser will not display the text in the examples above correctly, though it may display a subset of them. Because they are encoded according to the standard, though, they "will" display correctly on any system that is compliant and does have the characters available. Further, those characters given names for use in named entity references are likely to be more commonly available than others.

For displaying characters outside the Basic Multilingual Plane, like the Gothic letter faihu in the table above, some systems (like Windows 2000) need manual adjustments of their settings.

Frequency of usage

According to internal data from Google's web index, in December 2007 the UTF-8 Unicode encoding became the most frequently used encoding on web pages, overtaking both ASCII (US) and 8859-1/1252 (Western European). [Mark Davis: [http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html Moving to Unicode 5.1] Official Google blog, 5 May 2008]


See also

* Character encodings in HTML

External links

* [http://www.w3.org/TR/unicode-xml/ Unicode in XML and other Markup Languages] - a W3C & Unicode Consortium joint publication that describes issues and provides guidelines relating to Unicode in markup languages
* [http://www.w3.org/TR/REC-html40/HTMLlat1.ent Latin-1] , [http://www.w3.org/TR/REC-html40/HTMLspecial.ent "Special"] , and [http://www.w3.org/TR/REC-html40/HTMLsymbol.ent Mathematical, Greek and Symbolic] named character entity definitions for HTML 4.01
* [http://www.unicodemap.org/ UnicodeMap.org] - Browse Unicode characters, ranges, and other information
* [http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id= SIL's freeware fonts, editors and documentation]
* [http://www.alanwood.net/unicode/ Alan Wood’s Unicode Resources] - Unicode fonts and information (www.alanwood.net/unicode).
*http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm The International Phonetic Alphabet in Unicode
*http://www.alanwood.net/unicode/cjk_compatibility_ideographs.html CJK Compatibility Ideographs
*http://www.unicode.org/charts/ Unicode character charts; hexadecimal numbers only; PDF files showing all characters independent of browser capabilities
* [http://unicode.coeurlumiere.com/ Table of Unicode characters from 1 to 65535] - shows how they look in one's browser
* [http://www.pinyin.info/tools/converter/chars2uninumbers.html Web tool that converts "special" characters (such as Chinese characters) to Unicode numeric character references]
* [http://www.hotpeachpages.net/a/characters.html Multi-lingual web pages and Unicode] - how to fix display problems

Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Unicode and HTML for the Hebrew alphabet — See Hebrew alphabet for the main article on the Hebrew alphabet. The Unicode and HTML for the Hebrew alphabet are found in the following tables. The Unicode Hebrew block extends from U+0590 to U+05FF and from U+FB1D to U+FB40. It includes letters …   Wikipedia

  • Unicode and e-mail — Many E mail clients now offer some support for Unicode in E mail bodies. Most do not send in Unicode by default, but as time passes, more and more systems are likely to be set up with fonts capable of displaying the full range of Unicode… …   Wikipedia

  • List of XML and HTML character entity references — In SGML, HTML and XML documents, the logical constructs known as character data and attribute values consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series of… …   Wikipedia

  • Unicode equivalence — is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting standard character… …   Wikipedia

  • Unicode character property — Unicode assigns character properties to each code point.[1] These properties can be used to handle characters (code points) in processes, like in line breaking, script direction right to left or applying controls. Slightly inconsequently, some… …   Wikipedia

  • Unicode — For the 1889 Universal Telegraphic Phrase book, see Commercial code (communications). The Unicode official logo since October 2009 …   Wikipedia

  • Unicode font — A Unicode font (also known as UCS font and Unicode typeface) is a computer font that contains a wide range of characters, letters, digits, glyphs, symbols, ideograms, logograms, etc., which are collectively mapped into the standard Universal… …   Wikipedia

  • HTML — For the use of HTML on Wikipedia, see Help:HTML in wikitext. HTML (HyperText Markup Language) Filename extension .html, .htm Internet media type text/html Type code TEXT …   Wikipedia

  • Unicode symbols — v · Character Types Scripts Unihan ideographs, etc. Phonetic characters Punctuation and separators Diacritics and other marks Symbols Numerals Compatibility characters …   Wikipedia

  • HTML element — This article is about the HTML elements in general. For information on how to format Wikipedia entries, see Help:Wiki markup and Help:HTML in wikitext HTML HTML and HTML5 Dynamic HTML XHTML XHTML Mobile Profile and C HTML Canvas element Character …   Wikipedia