Clustering high-dimensional data

Clustering high-dimensional data

Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional data spaces are often encountered in areas such as medicine, where DNA microarray technology can produce a large number of measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the dictionary.

Contents

Problems

According to Kriegel, Kröger & Zimek (2009), four problems need to be overcome for clustering in high-dimensional data:

  • Multiple dimensions are hard to think in, impossible to visualize, and, due to the exponential growth of the number of possible values with each dimension, impossible to enumerate. This problem is known as the curse of dimensionality.
  • The concept of distance becomes less precise as the number of dimensions grows, since the distance between any two points in a given dataset converges. The discrimination of the nearest and farthest point in particular becomes meaningless:
\lim_{d \to \infty} \frac{dist_\max - dist_\min}{dist_\min} \to 0
  • A cluster is intended to group objects that are related, based on observations of their attribute's values. However, given a large number of attributes some of the attributes will usually not be meaningful for a given cluster. For example, in newborn screening a cluster of samples might identify newborns that share similar blood values, which might lead to insights about the relevance of certain blood values for a disease. But for different diseases, different blood values might form a cluster, and other values might be uncorrelated. This is known as the local feature relevance problem: different clusters might be found in different subspaces, so a global filtering of attributes is not sufficient.
  • Given a large number of attributes, it is likely that some attributes are correlated. Hence, clusters might exist in arbitrarily oriented affine subspaces.

Recent research by Houle et al. (2010) indicates that the discrimination problems only occur when there is a high number of irrelevant dimensions, and that shared-nearest-neighbor approaches can improve results.

Approaches

Approaches towards clustering in axis-parallel or arbitrarily oriented affine subspaces differ in how they interpret the overall goal, which is finding clusters in data with high dimensionality. This distinction is proposed in Kriegel, Kröger & Zimek (2009). An overall different approach is to find clusters based on pattern in the data matrix, often referred to as biclustering, which is a technique frequently utilized in bioinformatics.

Subspace Clustering

Example 2D space with subspace clusters

Subspace clustering is the task of detecting all clusters in all subspaces. This means that a point might be a member of multiple clusters, each existing in a different subspace. Subspaces can either be axis-parallel or affine. The term is often used synonymous with general clustering in high-dimensional data.

The image on the right shows a mere two-dimensional space where a number of clusters can be identified. In the one-dimensional subspaces, the clusters ca (in subspace {x}) and cb, cc, cd (in subspace {y}) can be found. cc cannot be considered a cluster in a two-dimensional (sub-)space, since it is too sparsely distributed in the x axis. In two dimensions, the two clusters cab and cad can be identified.

The problem of subspace clustering is given by the fact that there are 2d different subspaces of a space with d dimensions. If the subspaces are not axis-parallel, an infinite number of subspaces is possible. Hence, subspace clustering algorithm utilize some kind of heuristic to remain computationally feasible, at the risk of producing inferior results. For example, the downward-closure property (cf. association rules) can be used to build higher-dimensional subspaces only by combining lower-dimensional ones, as any subspace T containing a cluster, will result in a full space S also to contain that cluster (i.e. S ⊆ T), an approach taken by most of the traditional algorithms such as CLIQUE (Agrawal et al. 2005) and SUBCLU (Kailing, Kriegel & Kröger 2004).

Projected Clustering

Projected clustering seeks to assign each point to a unique cluster, but clusters may exist in different subspaces. The general approach is to use a special distance function together with a regular clustering algorithm.

For example, the PreDeCon algorithm checks which attributes seem to support a clustering for each point, and adjusts the distance function such that dimensions with low variance are amplified in the distance function (Bohm et al. 2004). In the figure above, the cluster cc might be found using DBSCAN with a distance function that places less emphasis on the x-axis and thus exaggerates the low difference in the y-axis sufficiently enough to group the points into a cluster.

PROCLUS uses a similar approach with a k-medoid clustering (Aggarwal et al. 1999). Initial medoids are guessed, and for each medoid the subspace spanned by attributes with low variance is determined. Points are assigned to the medoid closest, considering only the subspace of that medoid in determining the distance. The algorithm then proceeds as the regular PAM algorithm.

If the distance function weights attributes differently, but never with 0 (and hence never drops irrelevant attributes), the algorithm is called a "soft"-projected clustering algorithm.

Hybrid Approaches

Not all algorithms try to either find a unique cluster assignment for each point or all clusters in all subspaces; many settle for a result in between, where a number of possibly overlapping, but not necessarily exhaustive set of clusters are found. An example is FIRES, which is from its basic approach a subspace clustering algorithm, but uses a heuristic too aggressive to credibly produce all subspace clusters (Kriegel et al. 2005).

Correlation Clustering

Another type of subspaces is considered in Correlation clustering (Data Mining).

References

  • Aggarwal, Charu C.; Wolf, Joel L.; Yu, Philip S.; Procopiuc, Cecilia; Park, Jong Soo (1999), "Fast algorithms for projected clustering", ACM SIGMOD Record (New York, NY: ACM) 28 (2): 61–72, doi:10.1145/304181.304188 
  • Agrawal, Rakesh; Gehrke, Johannes; Gunopulos, Dimitrios; Raghavan, Prabhakar (2005), "Automatic Subspace Clustering of High Dimensional Data", Data Mining and Knowledge Discovery (Springer Netherlands) 11 (1): 5–33, doi:10.1007/s10618-005-1396-1 
  • Böhm, Christian; Kailing, Karin; Kriegel, Hans-Peter; Kröger, Peer (2004), "Density Connected Clustering with Local Subspace Preferences", Data Mining, IEEE International Conference on (Los Alamitos, CA, USA: IEEE Computer Society): 24–34, doi:10.1109/ICDM.2004.10087, ISBN 0-7695-2142-8 
  • Kailing, Karin; Kriegel, Hans-Peter; Kröger, Peer (2004), "Density-Connected Subspace Clustering for High-Dimensional Data", Proceedings of the Fourth SIAM International Conference on Data Mining (SIAM): 246–257, http://www.dbs.informatik.uni-muenchen.de/Publikationen/Papers/sdm04-subclu.pdf, retrieved 2009-05-25 
  • Kriegel, Hans-Peter; Kröger, Peer; Renz, Matthias; Wurst, Sebastian (2005), "A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data", Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM) (Washington, DC: IEEE Computer Society): 205–257, doi:10.1109/ICDM.2005.5, ISBN 0-7695-2278-5 
  • Kriegel, Hans-Peter; Kröger, Peer; Zimek, Arthur (2009), "Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering", ACM Transactions on Knowledge Discovery from Data (New York, NY: ACM) 3 (1): 1–58, doi:10.1145/1497577.1497578 
  • Houle, Michael E.; Kriegel, Hans-Peter; Kröger, Peer; Schubert, Erich; Zimek, Arthur (2010), "Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?", Proceedings of the 21st International Conference on Scientific and Statistical Database Management (SSDBM) (Heidelberg, Germany: Springer) 

Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

  • Data mining in meteorology — Meteorology is the interdisciplinary scientific study of the atmosphere. It observes the changes in temperature, air pressure, moisture and wind direction. Usually, temperature, pressure, wind measurements and humidity are the variables that are… …   Wikipedia

  • Correlation clustering — In machine learning, correlation clustering or cluster editing operates in a scenario where the relationship between the objects are known instead of the actual representation of the objects. For example, given a signed graph G = (V,E) where the… …   Wikipedia

  • Data mining — Not to be confused with analytics, information extraction, or data analysis. Data mining (the analysis step of the knowledge discovery in databases process,[1] or KDD), a relatively young and interdisciplinary field of computer science[2][3] is… …   Wikipedia

  • Determining the number of clusters in a data set — Determining the number of clusters in a data set, a quantity often labeled k as in the k means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem. For a certain …   Wikipedia

  • Oracle Data Mining — (ODM) is an option of Oracle Corporation s Relational Database Management System (RDBMS) Enterprise Edition (EE). It contains several data mining and data analysis algorithms for classification, prediction, regression, classification,… …   Wikipedia

  • Canopy clustering algorithm — The canopy clustering algorithm is an unsupervised clustering algorithm related to the K means algorithm.It is intended to speed up clustering operations on large data sets, where using another algorithm directly may be impractical because of the …   Wikipedia

  • Consensus clustering — Clustering is the assignment of objects into groups (called clusters) so that objects from the same cluster are more similar to each other than objects from different clusters. Often similarity is assessed according to a distance measure.… …   Wikipedia

  • Cluster analysis — The result of a cluster analysis shown as the coloring of the squares into three clusters. Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more… …   Wikipedia

  • Curse of dimensionality — The curse of dimensionality refers to various phenomena that arise when analyzing and organizing high dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low dimensional settings such as the physical space… …   Wikipedia

  • Biclustering — Biclustering, co clustering, or two mode clustering[1] is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Mirkin[2] (recently by Cheng and Church[3] in gene… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”