# Bag of words model in computer vision

﻿
Bag of words model in computer vision

This is an article introducing the "Bag of words model" (BoW) in computer vision, especially for object categorization. From now, the "BoW" model refers to the BoW model in computer vision unless explicitly declared.

Before introducing the BoW model, the BoW in natural language processing (NLP) is briefly reviewed. The BoW in NLP is a popular method for representing documents, which ignores the word orders. For example, "a good book" and "book good a" are the same under this model. The BoW model allows a dictionary-based modeling, and each document looks like a "bag" (thus the order is not considered), which contains some words from the dictionary. Computer vision researchers uses a similar idea for image representation (Here an image may refer to a particular object, such as an image of a car). For example, an image can be treated as a document, and features extracted from the image are considered as the "words" (Usually some manipulations are needed, which are described below). The BoW representation serves as the basic element for further processing, such as object categorization.

Representation based on the BoW model

Text Document Representation based on the BoW model

The text document representation based on the BoW model in NLP is reviewed first. Here are two simple text documents:
* John likes to watch movies. Mary likes too.
* John also likes to watch football games.Based on these two text documents, a dictionary is constructed as:
* dictionary={1:"John", 2:"likes", 3:"to", 4:"watch", 5:"movies", 6:"also", 7:"football", 8:"games", 9:"Mary", 10:"too"},which has 10 distinct words. And using the indexes of the dictionary, each document is represented by a 10-entry vector:
* [1, 2, 1, 1, 1, 0, 0, 0, 1, 1]
* [1, 1, 1, 1, 0, 1, 1, 1, 0, 0] ,where each entry of the vectors refers to count of the corresponding entry in the dictionary (This is also the histogram representation). As we can see, this vector representation does not preserve the order of the words in the original sentences. This kind of representation has several successful applications, for example latent Dirichlet allocation.cite journal
author = D. Blei, A. Ng, and M. Jordan
title = Latent Dirichlet allocation
url = http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf
journal = Journal of Machine Learning Research
volume = 3
pages = 993–1022
year = 2003
doi = 10.1162/jmlr.2003.3.4-5.993
]

Image Representation based on the BoW model

Figure 1 shows the basic of idea of the BoW model. To represent an image using BoW model, an image can be treated as a document. Similarly, "words" in images need to be defined too. However, "word" in images is not the off-the-shelf thing like the word in text documents. To achieve this, it usually includes following three steps: feature detection, feature description and codebook generation.cite conference
author = L. Fei-Fei and P. Perona
title = A Bayesian Hierarchical Model for Learning Natural Scene Categories
url = http://vision.cs.princeton.edu/documents/Fei-FeiPerona2005.pdf
booktitle = Proc. of IEEE Computer Vision and Pattern Recognition
pages = 524-531
year = 2005
] A definition of the BoW model can be the "histogram representation based on independent features".cite web
author = L. Fei-Fei, R. Fergus, and A. Torralba
title = Recognizing and Learning Object Categories, CVPR 2007 short course
url=http://people.csail.mit.edu/torralba/shortCourseRLOC/index.html
]

Feature Detection

Given an image, feature detection is to extract several local patches (or regions), which are considered as candidates for basic elements, "words".

Regular Grid

Figure 2 is an example of the regular grid method for feature detection. Regular grid [cite conference
author = J. Vogel and B. Schiele
title = On Performance Characterization and Optimization for Image Retrieval
booktitle = Proc. of European Conference on Computer Vision
pages = 51-55
year = 2002
] is probably the most simple yet effective method for feature detection. In this method, the image is evenly segmented by some horizontal and vertical lines and some local patches are obtained. This method shows very promising results for natural scene categorization. The limitation of this method is that it uses little information of an image itself.

Interest Point Detector

Interest point detectors try to detect salient patches, such as edges, corners and blobs in an image. These salient patches are considered more important than other patches, such as the regions attracting human attentions, which might be more useful for object categorization. Some famous detectors are Harris corner detector, [cite conference
author = C. Harris and M. Stephens
title = A combined corner and edge detector
booktitle = Proc. of the 4th Alvey Vision Conference
pages = 147-151
year=1988
url=http://www.csse.uwa.edu.au/~pk/research/matlabfns/Spatial/Docs/Harris/A_Combined_Corner_and_Edge_Detector.pdf
] Lowe’s DoG (Difference of Gaussians) detectorcite conference
author = D. Lowe
title = Object recognition with informative features and linear classification
url = http://www.cs.ubc.ca/~lowe/papers/iccv99.pdf
booktitle = Proc. of International Conference on Computer Vision
pages = 1150-1157
year = 1999
] and Kadir Brady saliency detector. [cite journal
author = T. Kadir and M. Brady
title = Scale, saliency and image description
url = http://www.robots.ox.ac.uk/~timork/Saliency/ijcv_SalScale.pdf
journal = International Journal of Computer Vision
volume = 45
issue = 2
pages = 83–105
year = 2001
doi = 10.1023/A:1012460413855
] Figure 3 shows a result of Harris corner detector.

Other Methods

In addition, researchers also use random sampling [cite conference
author = M. Vidal-Naquet and S. Ullman
title = Object recognition with informative features and linear classification
url = http://ieeexplore.ieee.org/iel5/8769/27772/01238356.pdf
booktitle = Proc. of IEEE International Conference on Computer Vision
pages = 281-288
year = 2003
] [cite conference
author = Maree, R and Geurts, P. and Piater, J. and Wehenkel, L.
title = Random subwindows for robust image classification
url = http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1467246&isnumber=31472
booktitle = Proc. of IEEE International Conference on Computer Vision and Pattern Recognition
pages = 34-40
year = 2005
] and segmentation methods [cite journal
author = K. Barnard, P. Duygulu, N. de Freitas, F. Forsyth, D. Blei, and M. Jordan
title = Matching words and pictures
url = http://kobus.ca/research/publications/JMLR-03/JMLR-03.pdf
journal = Journal of Machine Learning Research
volume = 3
pages = 1107–1135
year = 2003
doi = 10.1162/153244303322533214
] (such as Normalized Cut) for feature detection.

Feature Representation

After feature detection, each image is abstracted by several local patches. Feature representation methods deal with how to represent the patches as numerical vectors. These methods are called feature descriptors. A good descriptor should have the ability to handle intensity, rotation, scale and affine variations to some extent. One of the most famous descriptors is Scale-invariant feature transform (SIFT). SIFT converts each patch to 128-dimensional vector. After this step, each image is a collection of vectors of the same dimension (128 for SIFT), where the order of different vectors is of no importance.

Codebook Generation

The final step for the BoW model is to convert vector represented patches to "codewords" (analogy to words in text documents), which also produces a "codebook" (analogy to a word dictionary). A codeword can be considered as a representative of several similar patches. One simple method is performing K-means clustering over all the vectors. [cite journal
author = T. Leung and J. Malik
title = Representing and recognizing the visual appearance of materials using three-dimensional textons
url = http://www.cs.berkeley.edu/~malik/papers/LM-3dtexton.pdf
journal = International Journal of Computer Vision
volume = 43
issue = 1
pages = 29–44
year = 2001
doi = 10.1023/A:1011126920638
] Codewords are then defined as the centers of the learned clusters. The number of the clusters is the codebook size (analogy to the size of the word dictionary).

Thus, each patch in an image is mapped to a certain codeword through the clustering process and the image can be represented by the histogram (see Figure 4) of the codewords. Figure 5 shows several examples of codewords mapped back to image patches.

Learning and Recognition based on the BoW model

Till now, an image is represented based on a BoW model. Computer vision researchers have developed several learning methods to leverage the BoW model for image related task, such as object categorization. These methods can roughly be divided into two categories, generative and discriminative models. For multiple label categorization problem, the confusion matrix can be used as an evaluation metric.

Generative Models

Here are some notations for this section. Suppose the size of codebook is $V$.
* $w$: each patch $w$ is a V-dimensional vector that has a single component that equals to one and all other components equal to zero (For K-means clustering setting, the single component equal one indicates the cluster that $w$ belongs to). The $v$th codeword in the codebook can be represented as $w^v=1$ and $w^u = 0$ for $u eq v$.
* $mathbf\left\{w\right\}$: each image is represented by $mathbf\left\{w\right\}= \left[w_1, w_2, cdots, w_N\right]$, all the patches in an image.
* $d_j$: the $j$th image in an image collection.
* $c$: category of the image.
* $z$: theme or topic of the patch.
* $pi$: mixture proportion.Since the BoW model is an analogy to the BoW model in NLP, generative models developed in text domains can also be adapted in computer vision. Simple Naïve Bayes model and hierarchical Bayesian models are discussed.

Naïve Bayes

The simplest one is Naïve Bayes classifier.cite conference
author = C. Dance, J. Willamowski, L.X. Fan, C. Bray, and G. Csurka
title = Visual categorization with bags of keypoints
booktitle = Proc. of ECCV International Workshop on Statistical Learning in Computer Vision
year=2004
url=http://www.xrce.xerox.com/Publications/Attachments/2004-010/2004_010.pdf
] Using the language of graphical models, Naïve Bayes classifier is shown as Figure 6. The basic idea (or assumption) of this model is that each category has its own distribution over the codebook, which are assumed quite different from others. Take a face category and a car category for an example. The face category may emphasize the codewords which represent "nose", "eye" and "mouth", while the car category may emphasize the codewords which represent "wheel" and "corner". Given a collection training examples, the classifier learns different distributions for different categories. The categorization decision is made by
* $c^*=arg max_c p\left(c|mathbf\left\{w\right\}\right) = arg max_c p\left(c\right)p\left(mathbf\left\{w\right\}|c\right)=arg max_c p\left(c\right)prod_\left\{n=1\right\}^Np\left(w_n|c\right)$

Since Naïve Bayes classifier is simple yet effective, it is usually used as a baseline method for comparison.

Hierarchical Bayesian Models

The basic assumption of Naïve Bayes model does not hold sometimes. For example, a natural scene image (Figure 7) may contain several different themes.
Probabilistic latent semantic analysis (pLSA) [cite conference
author = T. Hoffman
title = Probabilistic Latent Semantic Analysis
url = http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf
booktitle = Proc. of the Fifteenth Conference on Uncertainty in Artificial Intelligence
year = 1999
] [cite conference
author = J. Sivic, B. Russell, A. Efros, A. Zisserman, and W. Freeman
title = Discovering objects and their location in images
url = http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic05b.pdf
booktitle = Proc. of International Conference on Computer Vision
year = 2005
] and latent Dirichlet allocation (LDA) are two popular topic models from text domains to tackle the similar multiple "theme" problem. Take LDA for an example. To model natural scene images using LDA, an analogy is made like this (Figure 9):
* the image category is mapped to the document category;
* the mixture proportion of themes maps the mixture proportion of topics;
* the theme index is mapped to topic index;
* the codeword is mapped to the word. This method shows very promising results in natural scene categorization on [http://vision.cs.princeton.edu/resources_links.html 13 Natural Scene Categories] .

Discriminative Models

Since images are represented based on the BoW model, any discriminative model suitable for text document categorization can be tried, such as support vector machine (SVM) and AdaBoost. [cite conference
author = T. Serre, L. Wolf, and T. Poggio
title = Object Recognition with Features Inspired by Visual Cortex
booktitle = Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition
year=2005
url = http://cbcl.mit.edu/projects/cbcl/publications/ps/serre-PID73457-05.pdf
] Kernel trick is also applicable when kernel based classifier is used, such as SVM. Pyramid match kernel is newly developed one based on the BoW model. The local feature approach of using BoW model representation learnt by machine learning classifiers with different kernels (e.g., EMD-kernel and $X^2$ kernel) has been vastly tested in the area of texture and object recognition.cite journal
author = Jianguo Zhang, Marcin Marszałek, Svetlana Lazebnik, Cordelia Schmid
title = Local Features and Kernels for Classification of Texture and Object Categories: a Comprehensive Study
journal = International Journal of Computer Vision
year = 2001
volume = 73
issue = 2
pages = 213–238
url = http://lear.inrialpes.fr/pubs/2007/ZMLS07/ZhangMarszalekLazebnikSchmid-IJCV07-ClassificationStudy.pdf
doi = 10.1007/s11263-006-9794-4
] . Very promising results on a number of datasets have been reported. This approachcite journal
author = Jianguo Zhang, Marcin Marszałek, Svetlana Lazebnik, Cordelia Schmid
title = Local Features and Kernels for Classification of Texture and Object Categories: a Comprehensive Study
journal = International Journal of Computer Vision
year = 2001
volume = 73
issue = 2
pages = 213–238
url = http://lear.inrialpes.fr/pubs/2007/ZMLS07/ZhangMarszalekLazebnikSchmid-IJCV07-ClassificationStudy.pdf
doi = 10.1007/s11263-006-9794-4
] has achieved very impressive result in the [http://www.pascal-network.org/challenges/VOC/ the PASCAL Visual Object Classes Challenge]

Pyramid Match Kernel

Pyramid match kernelcite conference
author = K. Grauman and T. Darrell
title = The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features
booktitle = Proc. of IEEE International Conference on Computer Vision
year=2005
url = http://www.cs.utexas.edu/~grauman/papers/grauman_darrell_iccv2005.pdf
] is a fast kernel function (satisfying Mercer's condition) which maps the BoW features to multi-resolution histograms. One of the advantages of the multi-resolution histograms is the ability to capture the co-occurring features. The pyramid match kernel builds the multi-resolution histogram by binning data points into discrete regions of increasing larger size. Thus, points do not match at high resolutions have the chance to match at low resolutions (Figure 9). Pyramid match kernel performs approximate similarity match, without explicit search, and the computation time is only linear in the number of features. Compared with other kernel approaches, pyramid match kernel is much faster, yet provides competitively accurate results. Pyramid match kernel was applied to [http://www.mis.informatik.tu-darmstadt.de/Research/Projects/categorization/eth80-db.html ETH-80 database] and [http://vision.cs.princeton.edu/resources_links.html Caltech 101 database] and showed promising results.

Limitations and recent Developments

One of notorious disadvantages of BoW is that it ignores the spatial relationships among the patches, which is very important in image representation. Researchers have proposed several methods to incorporate the spatial information. For feature level improvements, correlogram features can capture spatial co-occurrences of features. [cite conference
author = S. Savarese, J. Winn, and A. Criminisi
title = Discriminative Object Class Models of Appearance and Shape by Correlatons
url = http://johnwinn.org/Publications/papers/Savarese_Winn_Criminisi_Correlatons_CVPR2006.pdf
booktitle = Proc. of IEEE Computer Vision and Pattern Recognition
year = 2006
] For generative models, relative positions [cite conference
author = E. Sudderth, A. Torralba, W. Freeman, and A. Willsky
title = Learning Hierarchical Models of Scenes, Objects, and Parts
url = http://ssg.mit.edu/~esuddert/papers/iccv05.pdf
booktitle = Proc. of International Conference on Computer Vision
year = 2005
] [cite conference
author = E. Sudderth, A. Torralba, W. Freeman, and A. Willsky
title = Describing Visual Scenes using Transformed Dirichlet Processes
url = http://ssg.mit.edu/~esuddert/papers/nips05.pdf
booktitle = Proc. of Neural Information Processing Systems
year = 2005
] of codewords are also taken into account. The hierarchical shape and appearance model for human action [cite conference
author = J. C. Niebles and L. Fei-Fei
title = A hierarchical model model of shape and appearance for human action classification
url = http://vision.cs.princeton.edu/documents/NieblesFei-Fei_CVPR2007.pdf
booktitle = Proc. of IEEE Computer Vision and Pattern Recognition
year = 2007
] introduces a new part layer (Constellation model) between the mixture proportion and the BoW features, which captures the spatial relationships among parts in the layer. For discriminative models, spatial pyramid match [cite conference
author = S. Lazebnik, C. Schmid, and J. Ponce
title = Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories
url = http://www-cvr.ai.uiuc.edu/ponce_grp/publication/paper/cvpr06b.pdf
booktitle = Proc. of IEEE Conference on Computer Vision and Pattern Recognition
year = 2006
] performs pyramid matching by partitioning the image into increasingly fine sub-regions and compute histograms of local features inside each sub-region.

Furthermore, the BoW model has not been extensively tested yet for view point invariance and scale invariance, and the performance is unclear. Also the BoW model for object segmentation and localization is also lack of study.

References

* [http://people.csail.mit.edu/fergus/iccv2005/bagwords.html A demo for two bag-of-words classifiers] by L. Fei-Fei, R. Fergus, and A. Torralba.

ee also

* Part-based models
* Segmentation based object categorization

Wikimedia Foundation. 2010.

### Look at other dictionaries:

• Bag of words model — The bag of words model is a simplifying assumption used in natural language processing and information retrieval. In this model, a text (such as a sentence or a document) is represented as an unordered collection of words, disregarding grammar… …   Wikipedia

• Object recognition (computer vision) — Feature detection Output of a typical corner detection algorithm …   Wikipedia

• List of computer vision topics — This is a list of computer vision and image processing topics Contents 1 Image enhancement 2 Transformations 3 Filtering, Fourier and wavelet transforms and image compression …   Wikipedia

• Constellation model — The constellation model is a probabilistic, generative model for category level object recognition in computer vision. Like other part based models, the constellation model attempts to represent an object class by a set of N parts under mutual… …   Wikipedia

• List of words having different meanings in British and American English: A–L — Differences between American and British English American English …   Wikipedia

• Object categorization from image search — In computer vision, the problem of object categorization from image search is the problem of training a classifier to recognize categories of objects, using only the images retrieved automatically with an Internet search engine. Ideally,… …   Wikipedia

• Scale-invariant feature transform — Feature detection Output of a typical corner detection algorithm …   Wikipedia

• Boosting methods for object categorization — Given images containing various known objects in the world, a classifier can be learned from them to automatically categorize the objects in future images. Simple classifiers built based on some image feature of the object tend to be weak in… …   Wikipedia

• literature — /lit euhr euh cheuhr, choor , li treuh /, n. 1. writings in which expression and form, in connection with ideas of permanent and universal interest, are characteristic or essential features, as poetry, novels, history, biography, and essays. 2.… …   Universalium

• Human skin color — Skin pigmentation redirects here. For animal skin pigmentation, see Biological pigment. Variety of skin colors Human skin color is primarily due to the presence of melanin in the skin. Skin color ranges from almost black to white with a pinkish… …   Wikipedia