Fingerprint (computing)

Fingerprint (computing)

In computer science, a fingerprinting algorithm is a procedure that maps an arbitrarily large data item (such as a computer file) to a much shorter bit string, its fingerprint, that uniquely identifies the original data for all practical purposesA. Z. Broder. Some applications of Rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152. Springer-Verlag, 1993] .

Fingerprints are typically used to avoid the comparison and transmission of bulky data. For instance, a web browser or proxy server can efficiently check whether a remote file has been modified, by fetching only its fingerprint and comparing it with that of the previously fetched copyDetecting duplicate and near-duplicate files. US Patent 6658423 Issued on December 2 2003] A. Z. Broder, "On the Resemblance and Containment of Documents," Proceedings of Compression and Complexity of Sequences 1997, pp. 21-27, IEEE Computer Society (1988)] S. Brin et al., "Copy Detection Mechanisms for Digital Documents," Proceedings of the ACM SIGMOD Annual Conference, San Jose 1995 (May 1995)] L. Fan, P. Cao, J. Almeida and A. Broder, Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol, IEEE/ACM Transactions on Networking, vol. 8, No. 3 (2000)] U. Manber, Finding Similar Files in a Large File System. Proceedings of the USENIX Winter Technical Conf. (1994)]

Fingerprint functions are related to (and sometimes confused with) checksums, hash functions, cryptographic hash functions, and digital signatures. While these concepts overlap to some extent, each has distinct uses and properties. For example, audio fingerprint algorithms are more properly called hash functions than fingerprints, since they are designed with different goals.

Fingerprint properties

Virtual uniqueness

To serve its intended purposes, a fingerprinting algorithm must be able to capture the identity of a file with virtual certainty. In other words, the probability of a collision --- two files yielding the same fingerprint --- must be negligible, compared to the probability of other unavoidable causes of fatal errors (such as the system being destroyed by war or by a meteorite): say, 10-20 or less.

This requirement is somewhat similar to that of a checksum function, but is much more stringent. To detect accidental data corruption or transmission errors, it is sufficient that the checksums of the original file and any corrupted version will differ with near certainty, given some statistical model for the errors. In typical situations, this goal is easily achieved with 16- or 32-bit checksums. In contrast, file fingerprints need to be at least 64-bit long to guarantee virtual uniqueness in large file systems (see birthday attack).

When proving the above requirement, one must take into account that files are generated by highly non-random processes that create complicated dependencies among files. For instance, in a typical business network, one usually finds many pairs or clusters of documents that differ only by minor edits or other slight modifications. A good fingerprinting algorithm must ensure that such "natural" processes generate distinct fingerprints, with the desired level of certainty.

Compounding

Computer files are often combined in various ways, such as concatenation (as in archive files) or symbolic inclusion (as with the C preprocessor's #include directive). Some fingerprinting algorithms allow the fingerprint of a composite file to be computed from the fingerprints of its constituent parts. This "compounding" property may be useful in some applications, such as detecting when a program needs to be recompiled.

Fingerprinting algorithms

Rabin's algorithm

Rabin's fingerprinting algorithm M. O. Rabin Fingerprinting by random polynomials. Center for Research in Computing Technology Harvard University Report TR-15-81 (1981)] is the prototype of the class. It is fast and easy to implement, allows compounding, and comes with a mathematically precise analysis of the probability of collision. Namely, the probability of two strings "r" and "s" yielding the same "w"-bit fingerprint does not exceed max(|"r"|,|"s"|)/2"w"-1, where |"r"| denotes the length of "r" in bits. The algorithm requires the previous choice of a "w"-bit internal "key", and this guarantee holds as long as the strings "r" and "s" are chosen without knowledge of the key.

Rabin's method is not secure against malicious attacks. An adversary agent can easily discover the key and use it to modify files without changing their fingerprint.

Cryptographic hash functions

Cryptographic grade hash functions generally serve as good fingerprint functions, with the advantage that they are believed to be safe against malicious attacks.

However, cryptographic hash algorithms such as MD5 and SHA are considerably more expensive than Rabin's fingerprints, and lack proven guarantees on the probability of collision. Some of them, notably MD5 are no longer recommended for secure fingerprinting. However they still may be useful as an error checking mechanism, where purposeful data tampering isn't a primary concern.

Application examples

NIST distributes a software reference library, the American National Software Reference Library, that uses cryptographic hash functions to fingerprint files and map them to software products. The HashKeeper database, maintained by the National Drug Intelligence Center, is a repository of fingerprints of "known to be good" and "known to be bad" computer files, for use in law enforcement applications (e.g. analyzing the contents of seized disk drives).

See also

* Error correcting code
* Randomizing function

References


Wikimedia Foundation. 2010.

Игры ⚽ Поможем сделать НИР

Look at other dictionaries:

  • Fingerprint (disambiguation) — Fingerprint is usually a human fingerprint made by the pattern of ridges on the pad of a human finger. Other types of fingerprints include,* Genetic fingerprint, distinguishing two individuals of the same species using only samples of their DNA * …   Wikipedia

  • Fingerprint Verification Competition — (FVC) is an international competition focused on fingerprint verification software assessment. A subset of fingerprint impressions acquired with various sensors was provided to registered participants, to allow them to adjust the parameters of… …   Wikipedia

  • Fingerprint — This article is about human fingerprints. For other uses, see Fingerprint (disambiguation) …   Wikipedia

  • Device fingerprint — A device fingerprint (or machine fingerprint) is a compact summary of software and hardware settings collected from a remote computing device. Basic web browser configuration information has long been collected by web analytics services in an… …   Wikipedia

  • Digital fingerprint — may refer to: Message digest The output of a one way function when applied to a stream of data. Device fingerprint A compact summary of software and hardware settings collected from a remote computing device. This disambiguation page lists… …   Wikipedia

  • Header (computing) — In information technology, header refers to supplemental data placed at the beginning of a block of data being stored or transmitted. In data transmission, the data following the header are sometimes called the payload or body .It is vital that… …   Wikipedia

  • Rootkit — A rootkit is software that enables continued privileged access to a computer while actively hiding its presence from administrators by subverting standard operating system functionality or other applications. The term rootkit is a concatenation… …   Wikipedia

  • Cellular neural network — Cellular neural networks (CNN) are a parallel computing paradigm similar to neural networks, with the difference that communication is allowed between neighbouring units only. Typical applications include image processing, analyzing 3D surfaces,… …   Wikipedia

  • police — /peuh lees /, n., v., policed, policing. n. 1. Also called police force. an organized civil force for maintaining order, preventing and detecting crime, and enforcing the laws. 2. (used with a pl. v.) members of such a force: Several police are… …   Universalium

  • Wristwatch computer — The Fossil Wrist PDA, which runs Palm OS. A wristwatch computer is a wearable computer that fits like a wristwatch. It may offer features similar to a PDA, palmtop or tablet computer. Similar terms which refer to the same concept are wrist… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”