3D single object recognition


3D single object recognition

In computer vision, 3D single object recognition involves recognizing and determining the pose of user-chosen 3D object in a photograph or range scan. Typically, an example of the object to be recognized is presented to a vision system in a controlled environment, and then for an arbitrary input such as a video stream, the system locates the previously presented object. This can be done either off-line, or in real-time. The algorithms for solving this problem are specialized for locating a single pre-identified object, and can be contrasted with algorithms which operate on general classes of objects, such as face recognition systems or 3D generic object recognition. Due to the low cost and ease of acquiring photographs, a significant amount of research has been devoted to 3D object recognition in photographs.

3D single object recognition in photographs

The method of recognizing a 3D object depends on the properties of an object. For simplicity, many existing algorithms have focused on recognizing rigid objects consisting of a single part, that is, objects whose spatial transformation is a Euclidean motion. Two general approaches have been taken to the problem: pattern recognition approaches use low-level image appearance information to locate an object, while feature-based geometric approaches construct a model for the object to be recognized, and match the model against the photograph.

Pattern recognition approaches

These methods use appearance information gathered from pre-captured or pre-computed projections of an object to match the object in the potentially cluttered scene. However, they do not take the 3D geometric constraints of the object into consideration during matching, and typically also do not handle occlusion as well as feature-based approaches. See [Murase and Nayar 1995] and [Selinger and Nelson 1999] .

Feature-based geometric approaches

Feature-based approaches work well for objects which have distinctive features. Thus far, objects which have good edge features or blob features have been successfully recognized; for example detection algorithms, see Harris affine region detector and SIFT, respectively. Due to lack of the appropriate feature detectors, objects without textured, smooth surfaces cannot currently be handled by this approach.

Feature-based object recognizers generally work by pre-capturing a number of fixed views of the object to be recognized, extracting features from these views, and then in the recognition process, matching these features to the scene and enforcing geometric constraints.

As an example of a prototypical system taking this approach, we will present an outline of the method used by [Rothganger et al 2004] , with some detail elided. The method starts by assuming that objects undergo globally rigid transformations. Because smooth surfaces are locally planar, affine invariant features are appropriate for matching: the paper detects ellipse-shaped regions of interest using both edge-like and blob-like features, and as per [Lowe 2004] , finds the dominant gradient direction of the ellipse, converts the ellipse into a parallelogram, and takes a SIFT descriptor on the resulting parallelogram. Color information is used also to improve discrimination over SIFT features alone.

Next, given a number of camera views of the object (24 in the paper), the method constructs a 3D model for the object, containing the 3D spatial position and orientation of each feature. Because the number of views of the object is large, typically each feature is present in several adjacent views. The center points of such matching features correspond, and detected features are aligned along the dominant gradient direction, so the points at (1, 0) in the local coordinate system of the feature parallelogram also correspond, as do the points (0, 1) in the parallelogram's local coordinates. Thus for every pair of matching features in nearby views, three point pair correspondences are known. Given at least two matching features, a multi-view affine structure from motion algorithm (see [Tomasi and Kanade 1992] ) can be used to construct an estimate of points positions (up to an arbitrary affine transformation). The paper of Rothganger et al therefore selects two adjacent views, uses a RANSAC-like method to select two corresponding pairs of features, and adds new features to the partial model built by RANSAC so long as they are under an error term. Thus for any given pair of adjacent views, the algorithm creates a partial model of all features visible in both views.

To produce a unified model, the paper takes the largest partial model, and incrementally aligns all smaller partial models to it. Global minimization is used to reduce the error, then a Euclidean upgrade is used to change the model's feature positions from 3D coordinates unique up to affine transformation to 3D coordinates that are unique up to Euclidean motion. At the end of this step, one has a model of the target object, consisting of features projected into a common 3D space.

To recognize an object in an arbitrary input image, the paper detects features, and then uses RANSAC to find the affine projection matrix which best fits the unified object model to the 2D scene. If this RANSAC approach has sufficiently low error, then on success, the algorithm both recognizes the object and gives the object's pose in terms of an affine projection. Under the assumed conditions, the method typically achieves recognition rates of around 95%.

References

* Murase, H. and S. K. Nayar: 1995, "Visual Learning and Recognition of 3-D Objects from Appearance". International Journal of Computer Vision 14, 5–24. [http://www.cse.unr.edu/~bebis/MathMethods/PCA/case_study_pca2.pdf]
* Selinger, A. and R. Nelson: 1999, "A Perceptual Grouping Hierarchy for Appearance-Based 3D Object Recognition." Computer Vision and Image Understanding 76(1), 83–92. [http://citeseer.ist.psu.edu/282716.html]
* Rothganger, F; S. Lazebnik, C. Schmid, and J. Ponce: 2004. "3D Object Modeling and Recognition Using Local Affine-Invariant Image Descriptors and Multi-View Spatial Constraints", ICCV. [http://www-cvr.ai.uiuc.edu/ponce_grp/publication/paper/ijcv04d.pdf]
* Lowe, D.: 2004, "Distinctive image features from scale-invariant keypoints." International Journal of Computer Vision. In press. [http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf]
* Tomasi, C. and T. Kanade: 1992, "Shape and Motion from Image Streams: a Factorization Method." International Journal of Computer Vision 9(2), 137–154. [http://www.cse.huji.ac.il/course/2006/compvis/lectures/tomasiTr92Text.pdf]

See also

* Object recognition
* Feature detection (computer vision)
* Feature descriptor
* RANSAC
* SIFT
* Blob detection
* Harris affine region detector
* Structure from motion


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Object recognition — in computer vision is a task of finding given object in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the objects may vary somewhat in different view points,… …   Wikipedia

  • Object recognition (computer vision) — Feature detection Output of a typical corner detection algorithm …   Wikipedia

  • Object categorization from image search — In computer vision, the problem of object categorization from image search is the problem of training a classifier to recognize categories of objects, using only the images retrieved automatically with an Internet search engine. Ideally,… …   Wikipedia

  • Recognition by Components Theory — The Recognition by components theory, or RBC theory1, was proposed by Irving Biederman to explain object recognition. According to RBC theory, we are able to recognize objects by separating them into geons . Geons can be composed of various… …   Wikipedia

  • Object-relational impedance mismatch — The object relational impedance mismatch is a set of conceptual and technical difficulties that are often encountered when a relational database management system (RDBMS) is being used by a program written in an object oriented programming… …   Wikipedia

  • Boosting methods for object categorization — Given images containing various known objects in the world, a classifier can be learned from them to automatically categorize the objects in future images. Simple classifiers built based on some image feature of the object tend to be weak in… …   Wikipedia

  • Segmentation based object categorization — The image segmentation problem is concerned with partitioning an image into multiple regions according to some homogeneity criterion. This article is primarily concerned with graph theoretic approaches to image segmentation. Applications of Image …   Wikipedia

  • Gesture recognition — is a topic in computer science and language technology with the goal of interpreting human gestures via mathematical algorithms. Gestures can originate from any bodily motion or state but commonly originate from the face or hand. Current focuses… …   Wikipedia

  • Component Object Model — Not to be confused with COM file. Component Object Model (COM) is a binary interface standard for software componentry introduced by Microsoft in 1993. It is used to enable interprocess communication and dynamic object creation in a large range… …   Wikipedia

  • Pattern Recognition (novel) — infobox Book | name = Pattern Recognition image caption = Original 1st edition cover author = William Gibson cover artist = country = United States language = English series = genre = Science fiction novel publisher = G. P. Putnam s Sons release… …   Wikipedia