Nearest neighbor search

Nearest neighbor search: Nearest neighbor search (NNS), also known as proximity search, similarity search or closest point search, is an optimization problem for finding closest points in metric spaces. The problem is: given a set S of points in a metric space M and a query point q ∈ M, find the closest point in S to q. In many cases, M is taken to be d-dimensional Euclidean space and distance is measured by Euclidean distance or Manhattan distance.

Donald Knuth in vol. 3 of The Art of Computer Programming (1973) called it the post-office problem, referring to an application of assigning to a residence the nearest post office.

Contents

1 Applications

2 Methods

2.1 Linear search

2.2 Space partitioning

2.3 Locality sensitive hashing

2.4 Nearest neighbor search in spaces with small intrinsic dimension

2.5 Vector Approximation Files

2.6 Compression/Clustering Based Search

3 Variants

3.1 k-nearest neighbor

3.2 Approximate nearest neighbor

3.3 Nearest neighbor distance ratio

3.4 All nearest neighbors

4 See also

5 Notes

6 References

7 Further reading

8 External links

Applications

The nearest neighbor search problem arises in numerous fields of application, including:

Pattern recognition - in particular for optical character recognition

Statistical classification- see k-nearest neighbor algorithm

Computer vision

Databases - e.g. content-based image retrieval

Coding theory - see maximum likelihood decoding

Data compression - see MPEG-2 standard

Recommendation systems

Internet marketing - see contextual advertising and behavioral targeting

DNA sequencing

Spell checking - suggesting correct spelling

Plagiarism detection

Contact searching algorithms in FEA

Similarity scores for predicting career paths of professional athletes.

Cluster analysis - assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense, usually based on Euclidean distance

Methods

Various solutions to the NNS problem have been proposed. The quality and usefulness of the algorithms are determined by the time complexity of queries as well as the space complexity of any search data structures that must be maintained. The informal observation usually referred to as the curse of dimensionality states that there is no general-purpose exact solution for NNS in high-dimensional Euclidean space using polynomial preprocessing and polylogarithmic search time.

Linear search

The simplest solution to the NNS problem is to compute the distance from the query point to every other point in the database, keeping track of the "best so far". This algorithm, sometimes referred to as the naive approach, has a running time of O(Nd) where N is the cardinality of S and d is the dimensionality of M. There are no search data structures to maintain, so linear search has no space complexity beyond the storage of the database. Surprisingly, naive search outperforms space partitioning approaches on higher dimensional spaces.^[1]

Space partitioning

Since the 1970s, branch and bound methodology has been applied to the problem. In the case of Euclidean space this approach is known as spatial index or spatial access methods. Several space-partitioning methods have been developed for solving the NNS problem. Perhaps the simplest is the k-d tree, which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. Depending on the distance specified in the query, neighboring branches that might contain hits may also need to be evaluated. For constant dimension query time, average complexity is O(log N) ^[2] in the case of randomly distributed points, worst case complexity analyses have been performed.^[3] Alternatively the R-tree data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions.

In case of general metric space branch and bound approach is known under the name of metric trees. Particular examples include vp-tree and BK-tree.

Using a set of points taken from a 3-dimensional space and put into a BSP tree, and given a query point taken from the same space, a possible solution to the problem of finding the nearest point-cloud point to the query point is given in the following description of an algorithm. (Strictly speaking, no such point may exist, because it may not be unique. But in practice, usually we only care about finding any one of the subset of all point-cloud points that exist at the shortest distance to a given query point.) The idea is, for each branching of the tree, guess that the closest point in the cloud resides in the half-space containing the query point. This may not be the case, but it is a good heuristic. After having recursively gone through all the trouble of solving the problem for the guessed half-space, now compare the distance returned by this result with the shortest distance from the query point to the partitioning plane. This latter distance is that between the query point and the closest possible point that could exist in the half-space not searched. If this distance is greater than that returned in the earlier result, then clearly there is no need to search the other half-space. If there is such a need, then you must go through the trouble of solving the problem for the other half space, and then compare its result to the former result, and then return the proper result. The performance of this algorithm is nearer to logarithmic time than linear time when the query point is near the cloud, because as the distance between the query point and the closest point-cloud point nears zero, the algorithm needs only perform a look-up using the query point as a key to get the correct result.

Locality sensitive hashing

Locality sensitive hashing (LSH) is a technique for grouping points in space into 'buckets' based on some distance metric operating on the points. Points that are close to each other under the chosen metric are mapped to the same bucket with high probability.

Nearest neighbor search in spaces with small intrinsic dimension

The cover tree has a theoretical bound that is based on the dataset's doubling constant. The bound on search time is O(c¹² log n) where c is the expansion constant of the dataset.

Vector Approximation Files

In high dimensional spaces tree indexing structures become useless because an increasing percentage of the nodes need to be examined anyway. To speed up linear search, a compressed version of the feature vectors stored in RAM is used to prefilter the datasets in a first run. The final candidates are determined in a second stage using the uncompressed data from the disk for distance calculation.^[4]

Compression/Clustering Based Search

The VA-File approach is a special case of a compression based search, where each feature component is compressed uniformly and independently. The optimal compression technique in multidimensional spaces is Vector Quantization (VQ), implemented through clustering. The database is clustered and the most "promising" clusters are retrieved. Huge-gains over VA-File, tree-based indexes and sequential scan have been observed.^[5]^[6] Also note the parallels between clustering and LSH.

Variants

There are numerous variants of the NNS problem and the two most well-known are the k-nearest neighbor search and the ε-approximate nearest neighbor search.

k-nearest neighbor

Main article: k-nearest neighbor algorithm

k-nearest neighbor search identifies the top k nearest neighbors to the query. This technique is commonly used in predictive analytics to estimate or classify a point based on the consensus of its neighbors. k-nearest neighbor graphs are graphs in which every point is connected to its k nearest neighbors.

Approximate nearest neighbor

In some applications it may be acceptable to retrieve a "good guess" of the nearest neighbor. In those cases, we can use an algorithm which doesn't guarantee to return the actual nearest neighbor in every case, in return for improved speed or memory savings. Often such an algorithm will find the nearest neighbor in a majority of cases, but this depends strongly on the dataset being queried.

Algorithms that support the approximate nearest neighbor search include locality-sensitive hashing, best bin first and balanced box-decomposition tree based search.^[7]

ε-approximate nearest neighbor search is becoming an increasingly popular tool for fighting the curse of dimensionality.^{[citation needed]}

Nearest neighbor distance ratio

Nearest neighbor distance ratio do not apply the threshold on the direct distance from the original point to the challenger neighbor but on a ratio of it depending on the distance to the previous neighbor. It is used in CBIR to retrieve pictures through a "query by example" using the similarity between local features. More generally it is involved in several matching problems.

All nearest neighbors

For some applications (e.g. entropy estimation), we may have N data-points and wish to know which is the nearest neighbor for every one of those N points. This could of course be achieved by running a nearest-neighbor search once for every point, but an improved strategy would be an algorithm that exploits the information redundancy between these N queries to produce a more efficient search. As a simple example: when we find the distance from point X to point Y, that also tells us the distance from point Y to point X, so the same calculation can be reused in two different queries.

Given a fixed dimension, a semi-definite positive norm (thereby including every L^p norm), and n points in this space, the m nearest neighbours of every point can be found in O(mn log n) time.^[8]

See also

Ball tree

Cluster analysis

Content-based image retrieval

Curse of dimensionality

Digital signal processing

Dimension reduction

Fixed-radius near neighbors

Fourier Analysis

k-nearest neighbor algorithm

Linear least squares

Locality sensitive hashing

Multidimensional analysis

Nearest-neighbor interpolation

Principal Component Analysis

Singular value decomposition

Time series

Voronoi diagram

Wavelet

Notes

^ Weber, Schek, Blott. "A quantitative analysis and performance study for similarity search methods in high dimensional spaces". http://www.vldb.org/conf/1998/p194.pdf.

^ Andrew Moore. "An introductory tutorial on KD trees". http://www.autonlab.com/autonweb/14665/version/2/part/5/data/moore-tutorial.pdf?branch=main&language=en.

^ Lee, D. T.; Wong, C. K. (1977). "Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees". Acta Informatica 9 (1): 23–29. doi:10.1007/BF00263763.

^ Weber, Blott. "An Approximation-Based Data Structure for Similarity Search".

^ Ramaswamy, Rose, ICIP 2007. "Adaptive cluster-distance bounding for similarity search in image databases".

^ Ramaswamy, Rose, TKDE 2001. "Adaptive cluster-distance bounding for high-dimensional indexing".

^ S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman and A. Wu, An optimal algorithm for approximate nearest neighbor searching, Journal of the ACM, 45(6):891-923, 1998. [1]

^ Vaidya, P. M. (1989). "An O(n log n) Algorithm for the All-Nearest-Neighbors Problem". Discrete and Computational Geometry 4 (1): 101–115. doi:10.1007/BF02187718. http://www.springerlink.com/content/p4mk2608787r7281/?p=09da9252d36e4a1b8396833710ef08cc&pi=8.

References

Andrews, L.. A template for the nearest neighbor problem. C/C++ Users Journal, vol. 19 , no 11 (November 2001), 40 - 49, 2001, ISSN:1075-2838, www.ddj.com/architect/184401449

Arya, S., D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions. Journal of the ACM, vol. 45, no. 6, pp. 891–923

Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. 1999. When is nearest neighbor meaningful? In Proceedings of the 7th ICDT, Jerusalem, Israel.

Chung-Min Chen and Yibei Ling - A Sampling-Based Estimator for Top-k Query. ICDE 2002: 617-627

Samet, H. 2006. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann. ISBN 0123694469

Zezula, P., Amato, G., Dohnal, V., and Batko, M. Similarity Search - The Metric Space Approach. Springer, 2006. ISBN 0-387-29146-6

Further reading

Shasha, Dennis (2004). High Performance Discovery in Time Series. Berlin: Springer. ISBN 0387008578.

External links

Nearest Neighbors and Similarity Search - a website dedicated to educational materials, software, literature, researchers, open problems and events related to NN searching. Maintained by Yury Lifshits.

Similarity Search Wiki a collection of links, people, ideas, keywords, papers, slides, code and data sets on nearest neighbours.

Metric Spaces Library - An open-source C-based library for metric space indexing by Karina Figueroa, Gonzalo Navarro, Edgar Chávez.

ANN - A Library for Approximate Nearest Neighbor Searching by David M. Mount and Sunil Arya.

FLANN - Fast Approximate Nearest Neighbor search library by Marius Muja and David G. Lowe

STANN - A Simple Threaded Approximate Nearest Neighbor Search Library in C++ by Michael Connor and Piyush Kumar.

MESSIF - Metric Similarity Search Implementation Framework by Michal Batko and David Novak.

OBSearch - Similarity Search engine for Java (GPL). Implementation by Arnoldo Muller, developed during Google Summer of Code 2007.

KNNLSB - K Nearest Neighbors Linear Scan Baseline (distributed, LGPL). Implementation by Georges Quénot (LIG-CNRS).

NearTree - An API for finding nearest neighbors among points in spaces of arbitrary dimensions by Lawrence C. Andrews and Herbert J. Bernstein.

Categories:
Approximation algorithms
Classification algorithms
Data mining
Discrete geometry
Geometric algorithms
Image processing
Information retrieval
Machine learning
Numerical analysis
Mathematical optimization
Searching
Search algorithms

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

Nearest neighbor — may refer to: Nearest neighbor search in pattern recognition and in computational geometry Nearest neighbor interpolation for interpolating data Nearest neighbor graph in geometry The k nearest neighbor algorithm in machine learning, an… … Wikipedia
Nearest-neighbor interpolation — (blue lines) in one dimension on a (uniform) dataset (red points) … Wikipedia
Nearest-neighbor chain algorithm — In the theory of cluster analysis, the nearest neighbor chain algorithm is a method that can be used to perform several types of agglomerative hierarchical clustering, using an amount of memory that is linear in the number of points to be… … Wikipedia
k-nearest neighbor algorithm — KNN redirects here. For other uses, see KNN (disambiguation). In pattern recognition, the k nearest neighbor algorithm (k NN) is a method for classifying objects based on closest training examples in the feature space. k NN is a type of instance… … Wikipedia
K-nearest neighbor algorithm — In pattern recognition, the k nearest neighbor algorithm ( k NN) is a method for classifying objects based on closest training examples in the feature space. k NN is a type of instance based learning, or lazy learning where the function is only… … Wikipedia
Pattern search — may refer to: * Pattern recognition * Pattern mining * String searching algorithm * Fuzzy string searching * Bitap algorithm * K optimal pattern discovery * Nearest neighbor search (Nearest neighbor) * Eyeball search … Wikipedia
Tabu search — is a mathematical optimization method, belonging to the class of local search techniques. Tabu search enhances the performance of a local search method by using memory structures: once a potential solution has been determined, it is marked as… … Wikipedia
Breadth-first search — Infobox Algorithm class=Search Algorithm Order in which the nodes are expanded data=Graph time=O(|V|+|E|) = O(b^d) space=O(|V|+|E|) = O(b^d) optimal=yes (for unweighted graphs) complete=yesIn graph theory, breadth first search (BFS) is a graph… … Wikipedia
A* search algorithm — In computer science, A* (pronounced A star ) is a best first, graph search algorithm that finds the least cost path from a given initial node to one goal node (out of one or more possible goals). It uses a distance plus cost heuristic function… … Wikipedia
Scale-invariant feature transform — Feature detection Output of a typical corner detection algorithm … Wikipedia

Academic Dictionaries and Encyclopedias

Nearest neighbor search

Contents

Applications

Methods

Linear search

Space partitioning

Locality sensitive hashing

Nearest neighbor search in spaces with small intrinsic dimension

Vector Approximation Files

Compression/Clustering Based Search

Variants

k-nearest neighbor

Approximate nearest neighbor

Nearest neighbor distance ratio

All nearest neighbors

See also

Notes

References

Further reading

External links

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Nearest neighbor search

Contents

Applications

Methods

Linear search

Space partitioning

Locality sensitive hashing

Nearest neighbor search in spaces with small intrinsic dimension

Vector Approximation Files

Compression/Clustering Based Search

Variants

k-nearest neighbor

Approximate nearest neighbor

Nearest neighbor distance ratio

All nearest neighbors

See also

Notes

References

Further reading

External links

Look at other dictionaries:

Share the article and excerpts

Direct link