Null-terminated string

In computer programming, a null-terminated string is a character string stored as an array containing the characters and terminated with a null character ('\0', called NUL in ASCII). Alternative names are C string, which refers to the C programming language and ASCIIZ (note that C strings do not imply the use of ASCII).

The length of a C string is found by searching for the (first) NUL byte. This can be slow as it takes O(n) (linear time) with respect to the string length. It also means that a NUL cannot be inside the string, as the only NUL is the one marking the end.

Contents

History

C strings were produced by the .ASCIZ directive of the PDP-11 macroassembly languages and the ASCIZ directive of the MACRO-10 macro assembly language for the PDP-10. These predate the development of the C programming language, but other forms of strings were often used.

At the time C (and the languages that it was derived from) were developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (though also used by early versions of BASIC), used a leading byte to store the length of the string. This allows the string to contain NUL and made finding the length need only one memory access (O(1) (constant) time). But one byte limits the length to 255. This length limitation was far more restrictive than the problems with the C string, so the C string in general won out.[citation needed]

This had some influence on CPU instruction set design. Some CPUs in the 1970s and 1980s, such as the Zilog Z80 and the DEC VAX, had dedicated instructions for handling length-prefixed strings. However, as the NUL-terminated string gained traction, CPU designers began to take it into account, seen for example in IBM's decision to add the "Logical String Assist" instructions to the ES/9000 520 in 1992.

FreeBSD developer Poul-Henning Kamp, writing in ACM Queue, would later refer to the victory of the C string over use of a 2-byte length as "the most expensive one-byte mistake" ever.[1] However there are doubts that lengths longer than one byte were ever seriously considered.

Implementations

C programming language supports null-terminated strings as the primary string type.[2] There are a lot of functions for string handling in the C standard library.

Limitations

While simple to implement, this representation has been prone to errors and performance problems.

The NUL termination has historically created security problems.[3] A NUL byte inserted into the middle of a string will truncate it unexpectedly. A common bug was to not allocate the additional space for the NUL. Another was to not write the NUL at the end of a string, often not detected because there often happened to be a NUL already there. Due to the expense of finding the length, many programs did not bother before copying a string to a fixed-size buffer, causing a buffer overflow if it was too long.

The inability to store a NUL requires that string data and binary data be kept distinct and handled by different functions (with the latter requiring the length of the data to also be supplied). This can lead to code redundancy and errors when the wrong function is used.

The speed problems with finding the length can usually be mitigated by combining it with another operation that is O(n) anyway, such as in strlcpy. However this does not always result in an intuitive API.

Character encodings

Null-terminated strings put the requirement on the encoding that the 0x00 byte is not used for any character. This allows any ASCII extension to be used. But it does not allow UTF-16 in byte strings and therefore not directly in source code, since UTF-16 uses 2-byte integers which for example stores space as 0x0020. Equivalent null-terminated strings made of wchar_t can be used, ending with a 0-valued wchar_t.

UTF-8 works well however, and using a UTF-8 editor, one can write

char foo[512] = "φωωβαρ";

Truncating strings using functions like strncpy or strncat can produce invalid UTF-8 characters at the end. This can be unsafe if the truncated parts are not concatenated again before being interpreted by code that assumes the input is valid.

Improvements

Many attempts have been made to make C string handling less error prone. One strategy is to add safer and more useful functions such as strdup and strlcpy, while deprecating the use of unsafe functions such as gets. Another is to add an object-oriented wrapper around C strings so that only safe calls can be done.

On modern systems memory usage is less of a concern, so a multi-byte length is acceptable (if you have so many small strings that the space used by this length is a concern, you will have enough duplicates that a hash table will use even less memory). Most replacements for C strings use a 32 bit or larger length value. Examples include the C++ Standard Template Library std::string, the Qt QString, and the MFC CString. More complex structures may also be used to store strings such as the rope.

References

  1. ^ Kemp, Poul-Henning (25 July 2011), "The Most Expensive One-byte Mistake", ACM Queue 9 (7), ISSN 1542-7730, http://queue.acm.org/detail.cfm?id=2010365, retrieved 2 August 2011 
  2. ^ Richie, Dennis (2003). "The Development of the C Language". http://cm.bell-labs.com/cm/cs/who/dmr/chist.html. Retrieved 9 November 2011. 
  3. ^ Issue 55, article 7

Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Null — may refer to: Contents 1 In computing 2 In art 3 In mathematics 4 In science 5 People …   Wikipedia

  • String (computer science) — In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet. In computer programming, a string is traditionally a sequence of… …   Wikipedia

  • Null character — For other uses, see Null symbol. The null character (also null terminator), abbreviated NUL, is a control character with the value zero.[1] [2] It is present in many character sets, including ISO/IEC 646 (or ASCII), the C0 control code, the… …   Wikipedia

  • C string handling — C string redirects here. For the underwear and swimwear, see C string (clothing). C Standard Library Data types Character classification Strings Mathematics …   Wikipedia

  • Empty string — In computer science and formal language theory, the empty string (or null string)[1] is the unique string of length zero. It is denoted with λ or sometimes Λ or ε. The empty string is distinct from a null reference in that in an object oriented… …   Wikipedia

  • printf format string — An example of the printf function. Printf format string (which stands for print formatted ) refers to a control parameter used by a class of functions typically associated with some types of programming languages. The format string specifies a… …   Wikipedia

  • C string — In computing, a C string is a character sequence stored as a one dimensional character array and terminated with a null character ( …   Wikipedia

  • Shellcode — In computer security, a shellcode is a small piece of code used as the payload in the exploitation of a software vulnerability. It is called shellcode because it typically starts a command shell from which the attacker can control the compromised …   Wikipedia

  • Extended file attributes — is a file system feature that enables users to associate computer files with metadata not interpreted by the filesystem, whereas regular attributes have a purpose strictly defined by the filesystem (such as permissions or records of creation and… …   Wikipedia

  • Magic number (programming) — For other uses of the term, see Magic number (disambiguation). In computer programming, the term magic number has multiple meanings. It could refer to one or more of the following: A constant numerical or text value used to identify a file format …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”