- Charset detection
Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. This algorithm usually involves statistical analysis of byte patterns. This type of analysis can require frequency distribution of trigraphs of various languages encoded in each code page that will be detected. This process is not foolproof because it depends on statistical data; for example, some versions of the Windows operating system would mis-detect the phrase "Bush hid the facts" in ASCII as Chinese UTF-16LE.
One of the few cases where charset detection works reliably is detecting UTF-8. This is due to the large percentage of invalid byte sequences in UTF-8, so that text in any other encoding that uses bytes with the high bit set is extremely unlikely to pass a UTF-8 validity test. Unfortunately badly written charset detection routines do not run the reliable UTF-8 test first, and may decide that UTF-8 is some other encoding.
Due to the unreliability of charset detection, it is usually better to properly label datasets with the correct encoding. For example, HTML documents can declare their encoding in a
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
Alternatively, when documents are conveyed through HTTP, the same metadata can be conveyed out-of-band using the Content-type header.
- International Components for Unicode - A library that can perform charset detection.
- Frequency distributions of English trigraphs
- API reference for ICU charset detection
- Mozilla Charset Detectors
- Java port of Mozilla Charset Detectors
- Delphi/Pascal port of Mozilla Charset Detectors
Character encodings Character sets Early telecommunications ISO/IEC 8859 Bibliographic use National standards EUCCN · JP · KR · TW ISO/IEC 2022CN · JP · KR · CCCII MacOS codepages ("scripts") DOS codepages Windows codepages EBCDIC codepages37/1140 · 273/1141 · 277/1142 · 278/1143 · 280/1144 · 284/1145 · 285/1146 · 297/1147 · 420/16804 · 424/12712 · 500/1148 · 838/1160 · 871/1149 · 875/9067 · 930/1390 · 933/1364 · 937/1371 · 935/1388 · 939/1399 · 1025/1154 · 1026/1155 · 1047/924 · 1112/1156 · 1122/1157 · 1123/1158 · 1130/1164 · JEF · KEIS Platform specific Unicode / ISO/IEC 10646 Miscellaneous codepages Related topics This character encoding article is a stub. You can help Wikipedia by expanding it.