 Biclustering

Biclustering, coclustering, or twomode clustering^{[1]} is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Mirkin^{[2]} (recently by Cheng and Church^{[3]} in gene expression analysis), although the technique was originally introduced much earlier^{[2]} (i.e., by J.A. Hartigan^{[4]}).
Given a set of m rows in n columns (i.e., an matrix), the biclustering algorithm generates biclusters  a subset of rows which exhibit similar behavior across a subset of columns, or vice versa.
Contents
Complexity
The complexity of the biclustering problem depends on the exact problem formulation, and particularly on the merit function used to evaluate the quality of a given bicluster. However most interesting variants of this problem are NPcomplete requiring either large computational effort or the use of lossy heuristics to shortcircuit the calculation.^{[5]}
Type of Bicluster
Different biclustering algorithms have different definitions of bicluster. ^{[5]}
They are:
 Bicluster with constant values (a),
 Bicluster with constant values on rows (b) or columns (c),
 Bicluster with coherent values (d, e).
a) Bicluster with constant values 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 b) Bicluster with constant values on rows 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 c) Bicluster with constant values on columns 1.0 2.0 3.0 4.0 5.0 1.0 2.0 3.0 4.0 5.0 1.0 2.0 3.0 4.0 5.0 1.0 2.0 3.0 4.0 5.0 1.0 2.0 3.0 4.0 5.0 d) Bicluster with coherent values (additive) 1.0 4.0 5.0 0.0 1.5 4.0 7.0 8.0 3.0 4.5 3.0 6.0 7.0 2.0 3.5 5.0 8.0 9.0 4.0 5.5 2.0 5.0 6.0 1.0 2.5 e) Bicluster with coherent values (multiplicative) 1.0 0.5 2.0 0.2 0.8 2.0 1.0 4.0 0.4 1.6 3.0 1.5 6.0 0.6 2.4 4.0 2.0 8.0 0.8 3.2 5.0 2.5 10.0 1.0 4.0
The relationship between these cluster models and other types of clustering such as correlation clustering is discussed in.^{[6]}Algorithms
There are many biclustering algorithms developed for bioinformatics, including: block clustering, CTWC (Coupled TwoWay Clustering), ITWC (Interrelated TwoWay Clustering), δbicluster, δpCluster, δpattern, FLOC, OPC, Plaid Model, OPSMs (Orderpreserving submatrixes), Gibbs, SAMBA (StatisticalAlgorithmic Method for Bicluster Analysis),^{[7]} , Robust Biclustering Algorithm (RoBA), Crossing Minimization ^{[8]} , cMonkey,^{[9]} PRMs, DCC, LEB (Localize and Extract Biclusters), QUBIC (QUalitative BIClustering), BCCA (BiCorrelation Clustering Algorithm) and FABIA (Factor Analysis for Bicluster Acquisition).^{[10]} Biclustering algorithms have also been proposed and used in other application fields under the names coclustering, bidimentional clustering, and subspace clustering.^{[5]}
Given the known importance of discovering local patterns in time series data, recent proposals have addressed the biclustering problem in the specific case of time series gene expression data. In this case, the interesting biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the development of efficient exaustive enumeration algorithms such as CCCBiclustering ^{[11]} and eCCCBiclustering ^{[12]}. These algorithms ﬁnd and report all maximal biclusters with coherent and contiguous columns with perfect/approximate expression patterns, in time linear/polynomial in the size of the time series gene expression matrix using eﬃcient string processing techniques based on suffix trees.
Some recent algorithms have attempted to include additional support for biclustering rectangular matrices in the form of other datatypes, including cMonkey.
There is an ongoing debate about how to judge the results of these methods, as biclustering allows overlap between clusters and some algorithms allow the exclusion of hardtoreconcile columns/conditions. Not all of the available algorithms are deterministic and the analyst must pay attention to the degree to which results represent stable minima. Because this is an unsupervisedclassification problem, the lack of a gold standard makes it difficult to spot errors in the results. One approach is to utilize multiple biclustering algorithms, with majority or supermajority voting amongst them deciding the best result. Another way is to analyse the quality of shifting and scaling patterns in biclusters.^{[13]} Biclustering has been used in the domain of text mining (or classification) where it is popularly known as coclustering .^{[14]} Text corpora are represented in a vectorial form as a matrix D whose rows denote the documents and whose columns denote the words in the dictionary. Matrix elements D_{ij} denote occurrence of word j in document i. Coclustering algorithms are then applied to discover blocks in D that correspond to a group of documents (rows) characterized by a group of words(columns).
Several approaches have been proposed based on the information contents of the resulting blocks: matrixbased approaches such as SVD and BVD, and graphbased approaches. Informationtheoretic algorithms iteratively assign each row to a cluster of documents and each column to a cluster of words such that the mutual information is maximized. Matrixbased methods focus on the decomposition of matrices into blocks such that the error between the original matrix and the regenerated matrices from the decomposition is minimized. Graphbased methods tend to minimize the cuts between the clusters. Given two groups of documents d_{1} and d_{2}, the number of cuts can be measured as the number of words that occur in documents of groups d_{1} and d_{2}.
More recently (Bisson and Hussain)^{[14]} have proposed a new approach of using the similarity between words and the similarity between documents to cocluster the matrix. Their method (known as χSim, for cross similarity) is based on finding documentdocument similarity and wordword similarity, and then using classical clustering methods such as hierarchical clustering. Instead of explicitly clustering rows and columns alternately, they consider higherorder occurrences of words, inherently taking into account the documents in which they occur. Thus, the similarity between two words is calculated based on the documents in which they occur and also the documents in which "similar" words occur. The idea here is that two documents about the same topic do not necessarily use the same set of words to describe it but a subset of the words and other similar words that are characteristic of that topic. This approach of taking higherorder similarities takes the latent semantic structure of the whole corpus into consideration with the result of generating a better clustering of the documents and words.
In contrast to other approaches, FABIA is a multiplicative model that assumes realistic nonGaussian signal distributions with heavy tails. FABIA utilizes well understood model selection techniques like variational approaches and applies the Bayesian framework. The generative framework allows FABIA to determine the information content of each bicluster to separate spurious biclusters from true biclusters.
See also
 Formal concept analysis
 Biclique
 Galois connection
References
 ^ Van Mechelen I, Bock HH, De Boeck P (2004). "Twomode clustering methods:a structured overview". Statistical Methods in Medical Research 13 (5): 363–94. doi:10.1191/0962280204sm373ra. PMID 15516031.
 ^ ^{a} ^{b} Mirkin, Boris (1996). Mathematical Classification and Clustering. Kluwer Academic Publishers. ISBN 0792341597.
 ^ Cheng Y, Church GM (2000). "Biclustering of expression data". Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology: 93–103.
 ^ Hartigan JA (1972). "Direct clustering of a data matrix". Journal of the American Statistical Association (American Statistical Association) 67 (337): 123–9. doi:10.2307/2284710. JSTOR 2284710.
 ^ ^{a} ^{b} ^{c} Madeira SC, Oliveira AL (2004). "Biclustering Algorithms for Biological Data Analysis: A Survey". IEEE Transactions on Computational Biology and Bioinformatics 1 (1): 24–45. doi:10.1109/TCBB.2004.2. PMID 17048406.
 ^ Kriegel, H.P.; Kröger, P., Zimek, A. (March 2009). "Clustering High Dimensional Data: A Survey on Subspace Clustering, Patternbased Clustering, and Correlation Clustering". ACM Transactions on Knowledge Discovery from Data (TKDD) 3 (1): 1–58. doi:10.1145/1497577.1497578. http://doi.acm.org/10.1145/1497577.1497578.
 ^ Tanay A, Sharan R, Kupiec M and Shamir R (2004). "Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data". Proc Natl Acad Sci USA 101 (9): 2981–2986. doi:10.1073/pnas.0308661100. PMC 365731. PMID 14973197. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=365731.
 ^ Abdullah, Ahsan; Hussain, Amir (2006). "A new biclustering technique based on crossing minimization". Neurocomputing, vol. 69 issue 1618 69 (16–18): 1882–1896. doi:10.1016/j.neucom.2006.02.018. http://linkinghub.elsevier.com/retrieve/pii/S0925231206001615.
 ^ Reiss DJ, Baliga NS, Bonneau R (2006). "Integrated biclustering of heterogeneous genomewide datasets for the inference of global regulatory networks". BMC Bioinformatics 2: 280–302. doi:10.1186/147121057280. PMC 1502140. PMID 16749936. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=1502140.
 ^ Hochreiter S, Bodenhofer U, Heusel M, Mayr A, Mitterecker A, Kasim A, Khamiakova T, Van Sanden S, Lin D, Talloen W, Bijnens L, Gohlmann HWH, Shkedy Z, Clevert DA (2010). "FABIA: factor analysis for bicluster acquisition". Bioinformatics 26 (12): 1520–1527. doi:10.1093/bioinformatics/btq227. PMC 2881408. PMID 20418340. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=2881408.
 ^ Madeira SC, Teixeira MC, SáCorreia I, Oliveira AL (2010). "Identification of Regulatory Modules in Time Series Gene Expression Data using a Linear Time Biclustering Algorithm". IEEE Transactions on Computational Biology and Bioinformatics 1 (7): 153–165. doi:10.1109/TCBB.2008.34.
 ^ Madeira SC, Oliveira AL (2009). "A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series". Algorithms for Molecular Biology 4 (8).
 ^ AguilarRuiz JS (2005). "Shifting and scaling patterns from gene expression data". Bioinformatics 21 (10): 3840–3845. doi:10.1093/bioinformatics/bti641. PMID 16144809.
 ^ ^{a} ^{b} Bission G. and Hussain F. (2008). "ChiSim: A new similarity measure for the coclustering task". ICMLA: 211–217. doi:10.1109/ICMLA.2008.103.
Others
 A. Tanay. R. Sharan, and R. Shamir, "Biclustering Algorithms: A Survey", In Handbook of Computational Molecular Biology, Edited by Srinivas Aluru, Chapman (2004)
 Kluger Y, Basri R, Chang JT, Gerstein MB (2003). "Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions". Genome Research 13 (4): 703–716. doi:10.1101/gr.648603. PMC 430175. PMID 12671006. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=430175.
External links
Categories:
Wikimedia Foundation. 2010.
Look at other dictionaries:
biclustering — noun The simultaneous clustering of the rows and columns of a matrix See Also: bicluster … Wiktionary
Classification double — La Classification double ou « Biclustering » est une technique d exploration de données non supervisée permettant de segmenter simultanément les lignes et les colonnes d une matrice. Plus formellement[1], la définition de la… … Wikipédia en Français
Correlation clustering — In machine learning, correlation clustering or cluster editing operates in a scenario where the relationship between the objects are known instead of the actual representation of the objects. For example, given a signed graph G = (V,E) where the… … Wikipedia
Data mining in agriculture — Contents 1 Introduction 2 Applications 2.1 Prediction of problematic wine fermentations 2.2 Detection of diseases from sounds issued by animals … Wikipedia
Clique (graph theory) — A graph with 23 1 vertex cliques (its vertices), 42 2 vertex cliques (its edges), 19 3 vertex cliques (the light blue triangles), and 2 4 vertex cliques (dark blue). Six of the edges and 11 of the triangles form maximal cliques. The two dark blue … Wikipedia
Cluster analysis — The result of a cluster analysis shown as the coloring of the squares into three clusters. Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more… … Wikipedia
bicluster — noun A subset of rows or columns of a matrix produced by biclustering … Wiktionary
coclustering — noun biclustering … Wiktionary
Clustering highdimensional data — is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high dimensional data spaces are often encountered in areas such as medicine, where DNA microarray technology can produce a large number of… … Wikipedia
Glossaire du data mining — Exploration de données Articles principaux Exploration de données Fouille de données spatiales Fouille du web Fouille de flots de données Fouille de textes … Wikipédia en Français