 OPTICS algorithm

OPTICS ("Ordering Points To Identify the Clustering Structure") is an algorithm for finding densitybased clusters in spatial data. It was presented by Mihael Ankerst, Markus M. Breunig, HansPeter Kriegel and Jörg Sander^{[1]}. Its basic idea is similar to DBSCAN,^{[2]} but it addresses one of DBSCAN's major weaknesses: the problem of detecting meaningful clusters in data of varying density. In order to do so, the points of the database are (linearly) ordered such that points which are spatially closest become neighbors in the ordering. Additionally, a special distance is stored for each point that represents the density that needs to be accepted for a cluster in order to have both points belong to the same cluster. This is represented as a dendrogram.
Contents
Basic idea
Like DBSCAN, OPTICS requires two parameters: ε, which describes the maximum distance (radius) to consider, and MinPts, describing the number of points required to form a cluster. A point p is a core point if at least MinPts points are found within its εneighborhood N_{ε}(p). Contrary to DBSCAN, OPTICS also considers points that are part of a more densely packed cluster, so each point is assigned a core distance that basically describes the distance to its MinPtsth point:
The reachabilitydistance of a point p from another point o is the distance between p and o, or the core distance of o:
Basically, if o and p are nearest neighbors, this is the ε' < ε we need to assume in order to have o and p belong to the same cluster.
Both the coredistance and the reachabilitydistance are undefined if no sufficiently dense cluster (w.r.t. ε) is available. Given a sufficiently large ε, this will never happen, but then every εneighborhood query will return the entire database, resulting in O(n^{2}) runtime. Hence, the ε parameter is required to cut off the density of clusters that is no longer considered to be interesting and to speed up the algorithm this way.
The parameter ε is strictly speaking not necessary. It can be set to a maximum value. When a spatial index is available, it does however play a practical role when it comes to complexity. It is often claimed that OPTICS abstracts from DBSCAN by removing this parameter, at least to the amount of only having to give a maximum value.
Pseudocode
The basic approach of OPTICS is similar to DBSCAN, but instead of maintaining a set of known, but so far unprocessed cluster members, a priority queue (e.g. using an indexed heap) is used.
OPTICS(DB, eps, MinPts) for each point p of DB p.reachabilitydistance = UNDEFINED for each unprocessed point p of DB N = getNeighbors(p, eps) mark p as processed output p to the ordered list Seeds = empty priority queue if (coredistance(p, eps, Minpts) != UNDEFINED) update(N, p, Seeds, eps, Minpts) for each next q in Seeds N' = getNeighbors(q, eps) mark q as processed output q to the ordered list if (coredistance(q, eps, Minpts) != UNDEFINED) update(N', q, Seeds, eps, Minpts)
In update(), the priority queue Seeds is updated with the εneighborhood of p and q, respectively:
update(N, p, Seeds, eps, Minpts) coredist = coredistance(p, eps, MinPts) for each o in N if (o is not processed) newreachdist = max(coredist, dist(p,o)) if (o.reachabilitydistance == UNDEFINED) // o is not in Seeds o.reachabilitydistance = newreachdist Seeds.insert(o, newreachdist) else // o in Seeds, check for improvement if (newreachdist < o.reachabilitydistance) o.reachabilitydistance = newreachdist Seeds.moveup(o, newreachdist)
OPTICS hence outputs the points in a particular ordering, annotated with their smallest reachability distance (in the original algorithm, the core distance is also exported, but this is not required for further processing).
Extracting the Clusters
Using a reachabilityplot (a special kind of dendrogram), the hierarchical structure of the clusters can be obtained easily. It is a 2D plot, with the ordering of the points on the xaxis and the reachability distance on the yaxis. Since points belonging to a cluster have a low reachability distance to their nearest neighbor, the clusters show up as valleys in the reachability plot. The deeper the valley, the denser the cluster.
The image on the right illustrates this concept. In its upper half, an artificial example of a database consisting of twodimensional, spatial points is shown. The lower part shows the reachability plot as computed by OPTICS. The black lines link some clusters to their respective valleys. The horizontal red line is an example on how to obtain a clustering. Each valley it crosses is made a cluster of its own. If the line was moved down, more clusters would emerge, especially for the topmost cluster, which features varying densities.
Note that deriving clusters in such a way yields the same result on core points of running DBSCAN on the data with ε set to the chosen reachabilitydistance threshold. The assignment of noncore points to neighboring clusters is nondeterministic in DBSCAN, too.
The blue points in this image are considered noise, and no valley is found in their reachability plot. This is subject to the ε parameter, which bounds the density of clusters.
A more advanced analysis does not use a specific value of ε, but instead looks for spikes that separate clusters. This can be used to obtain a hierarchical clustering that cannot be achieved by a single DBSCAN run.
Complexity
Like DBSCAN, OPTICS processes each point once, and performs one εneighborhood query during this processing. Given a spatial index that grants a neighborhood query in O(log n) runtime, an overall runtime of is obtained. The authors of the original OPTICS paper report an actual constant slowdown factor of 1.6 compared to DBSCAN. Note that the value of ε might heavily influence the cost of the algorithm, since a value too large might raise the cost of a neighborhood query to linear complexity.
In particular, choosing ε > max _{x,y}d(x,y) (larger than the maximum distance in the data set) is possible, but will obviously lead to quadratic complexity, since every neighborhood query will return the full data set. Even when no spatial index is available, this comes at additional cost in managing the heap. Therefore, ε should be chosen appropriately for the data set.
Extensions
OPTICSOF^{[3]} is an outlier detection algorithm based on OPTICS. The main use obviously is the extraction of outliers from an existing run of OPTICS at low cost compared to using a different outlier detection method.
DeLiClu^{[4]}, DensityLinkClustering combines ideas from singlelinkage clustering and OPTICS, eliminating the ε parameter and offering performance improvements over OPTICS.
HiSC^{[5]} is a hierarchical subspace clustering (axisparallel) method based on optics.
HiCO^{[6]} is a hierarchical correlation clustering algorithm based on OPTICS.
DiSH^{[7]} is an improvement over HiSC that can find more complex hierarchies.
Availability
Example implementations of OPTICS, OPTICSOF, DeLiClu, HiSC, HiCO and DiSH are available in the ELKI framework.
References
 ^ Mihael Ankerst, Markus M. Breunig, HansPeter Kriegel, Jörg Sander (1999). "OPTICS: Ordering Points To Identify the Clustering Structure". ACM SIGMOD international conference on Management of data. ACM Press. pp. 49–60. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.129.6542.
 ^ Martin Ester, HansPeter Kriegel, Jörg Sander, Xiaowei Xu (1996). "A densitybased algorithm for discovering clusters in large spatial databases with noise". In Evangelos Simoudis, Jiawei Han, Usama M. Fayyad. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD96). AAAI Press. pp. 226–231. ISBN 1577350049. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.71.1980.
 ^ Markus M. Breunig, HansPeter Kriegel, Raymond T. Ng and Jörg Sander (1999). "OPTICSOF: Identifying Local Outliers". Principles of Data Mining and Knowledge Discovery. SpringerVerlag. pp. 262–270. doi:10.1007/b72280. ISBN 9783540664901. http://springerlink.metapress.com/content/76bx6413gqb4tvta/.
 ^ Achtert, E.; Böhm, C.; Kröger, P. (2006). "DeLiClu: Boosting Robustness, Completeness, Usability, and Efficiency of Hierarchical Clustering by a Closest Pair Ranking". LNCS: Advances in Knowledge Discovery and Data Mining 3918: 119–128. doi:10.1007/11731139_16.
 ^ Achtert, E.; Böhm, C.; Kriegel, H. P.; Kröger, P.; MüllerGorman, I.; Zimek, A. (2006). "Finding Hierarchies of Subspace Clusters". LNCS: Knowledge Discovery in Databases: PKDD 2006 4213: 446–453. doi:10.1007/11871637_42.
 ^ Achtert, E.; Böhm, C.; Kröger, P.; Zimek, A. (2006). "Mining Hierarchies of Correlation Clusters". Proc. 18th International Conference on Scientific and Statistical Database Management (SSDBM): 119–128. doi:10.1109/SSDBM.2006.35.
 ^ Achtert, E.; Böhm, C.; Kriegel, H. P.; Kröger, P.; MüllerGorman, I.; Zimek, A. (2007). "Detection and Visualization of Subspace Cluster Hierarchies". LNCS: Advances in Databases: Concepts, Systems and Applications 4443: 152–163. doi:10.1007/9783540717034_15.
Categories: Data clustering algorithms
Wikimedia Foundation. 2010.
Look at other dictionaries:
OPTICS — (Ordering Points To Identify the Clustering Structure, etwa ‚Punkte ordnen um die Clusterstruktur zu identifizieren‘) ist ein dichtebasierter Algorithmus zur Clusteranalyse. Er wurde von Mihael Ankerst, Markus M. Breunig, Hans Peter Kriegel und… … Deutsch Wikipedia
Adaptiveadditive algorithm — In the studies of Fourier optics, sound synthesis, stellar interferometry, optical tweezers, and diffractive optical elements (DOEs) it is often important to know the spatial frequency phase of an observed wave source. In order to reconstruct… … Wikipedia
Gerchberg–Saxton algorithm — The Gerchberg Saxton (GS) algorithm is an iterative algorithm for retrieving the phase of a pair of light distributions (or any other mathematically valid distribution) related via a propagating function, such as the Fourier transform, if their… … Wikipedia
Nonimaging optics — (also called anidolic optics)[1][2] is the branch of optics concerned with the optimal transfer of light radiation between a source and a target. Unlike traditional imaging optics, the techniques involved do not attempt to form an image of the… … Wikipedia
Adaptive optics — (AO) is a technology used to improve the performance of optical systems by reducing the effects of rapidly changing optical distortion. It is used in astronomical telescopes and laser communication systems to remove the effects of atmospheric… … Wikipedia
Fourier optics — is the study of classical optics using techniques involving Fourier transforms and can be seen as an extension of the Huygens Fresnel principle. The underlying theorem that light waves can be described as made up of sinusoidal waves, in a manner… … Wikipedia
Deutsch–Jozsa algorithm — The Deutsch–Jozsa algorithm is a quantum algorithm, proposed by David Deutsch and Richard Jozsa in 1992[1] with improvements by Richard Cleve, Artur Ekert, Chiara Macchiavello, and Michele Mosca in 1998.[2] Although it is of little practical use … Wikipedia
Differencemap algorithm — Iterations 0, 100, 200, 300 and 400 in the difference map reconstruction of a grayscale image from its Fourier transform modulus The difference map algorithm is a search algorithm for general constraint satisfaction problems. It is a meta… … Wikipedia
Difference map algorithm — The difference map algorithm is a search algorithm for general constraint satisfaction problems. It is a meta algorithm in the sense that it is built from more basic algorithms that perform projections onto constraint sets. From a mathematical… … Wikipedia
DBSCAN — (for density based spatial clustering of applications with noise) is a data clustering algorithm proposed by Martin Ester, Hans Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.[1] It is a density based clustering algorithm because it finds a… … Wikipedia