Archive for the ‘Computer Vision’ Category

How to solve the Image Distortion Problem

Posted on: No Comments

      Theoretically, it is possible to define lens which will not introduce distortions. In practice, however, no lens is perfect. This is mainly due to manufacturing factors; it is much easier to make a “spherical” lens than to make a more mathematically ideal “parabolic” lens. It is also difficult to mechanically align the lens and imager exactly. Here we describe the two main lens distortions and how to model them. Radial distortions arise as a result of the shape of lens, whereas tangential distortions arise from the assembly process of the camera as a whole.

      We start with the radial distortion. The lenses of real cameras often noticeably distort the location of pixels near the edges of the imager. This bulging phenomenon is the source of the “barrel” or “fish-eye” effect. Figure1 gives some intuition as to why radial distortion occurs. With some lenses, rays farther from the center of the lens are bent more than those closer in. A typical inexpensive lens is, in effect, stronger than it ought to be as you get farther from the center. Barrel distortion is particularly noticeable in cheap web cameras but less apparent in high-end cameras, where a lot of effort is put into fancy lens systems that minimize radial distortion.

Figure 1.

        For radial distortions, the distortion is 0 at the (optical) center of the imager and increases as we move toward the periphery. In practice, this distortion is small and can be characterized by the first few terms of a Taylor series expansion around r = 0. For cheaper web cameras, we generally use the first two such terms; the first of which is conventionally called k1 and the second k2. For highly distorted cameras such as fish-eye lenses we can use a third radial distortion term k3.

In general, the radial location of a point on the image will be rescaled according to the following equations:

Here, (x, y) is the original location (on the imager) of the distorted point and (xcorrected, ycorrected) is the new location as a result of the correction.

The second largest common distortion is known as tangential distortion. This distortion is due to manufacturing defects resulting from the lens not being exactly parallel to the imaging plane.

Tangential distortion is minimally characterized by two additional parameters, p1 and p2, such that:

Thus, in total five (or six in some cases) distortion coefficients are going to be required.

Distortion example

      To ilustrate these theoretical points, I am going to show a couple of images taken by the same camera. The first of these images will show the picture with the distortion effect, whereas the second one will show the result of applying “undistortion” functions.

Figure 2. The image on the left shows a distorted image, while the right image shows an image in which the distortion has been corrected through the methods explained.


Image Locals Features Descriptors in Augmented Reality

Posted on: 1 Comment

          As computer vision experts, we have to handle almost daily with the information “hidden” on images in order to generate information “visible” and useful for our algorithms. In this entry I want to talk about the Image Matching process. The image matching is the technique used in Computer Vision to find enough patches or strong features in a couple – or more- images in order to be able to state that one of these images is contained on the other one, or that both images are the same image. For this purpose several approaches have been proposed in the literature, but we are going to focus on local features approaches.

     Local Feature representation of images are widely used for matching and recognition in the field of computer vision and, lately also used in  Augmented Reality applications by adding any augmented information on the real world. Robust feature descriptors such as SIFT, SURF, FAST, Harris-Affine or GLOH (to name some examples) have become a core component in those kind of applications. The main idea about it is first detect features and then compute a set of descriptors for these features. One important thing to keep in mind is that all these methods will be ported to mobile devices later, and they could drive to very heavy processes, not reaching real-time rates. Thus, several techniques are lately developed so that the features detection and descriptors extraction methods selected can be implemented in mobiles devices with a real-time performance. But this is another step I do not want to focus on in this entry.

Local Features

           But what is a local feature? A local feature is an image pattern which differs from its immediate neighborhood. This difference can be associated with a change of an image property, which the commonly considered are texture, intensity and color.


Virtual Buttons in Augmented Reality. Part II

Posted on: No Comments

      Next step after users are able to interact with the Augmented information through touch-screen events is to provide them a more realistic interaction through touch events on the real object.  The idea behind this is simple. In addition to the already well-known objects placed in the real world to carry out an image or object tracking, where a video or a 3D object usually is layered, we suggest to add buttons to the scene which the user will be able to “virtually “press. Over the image to be tracked, one or more hot spots or hot areas are defined and over them several actions can be associated.

      Unlike in the previous entry in which we also talked about these virtual buttons, but the video showed a situation where they were placed in a static way, the next video talks about these virtual buttons as addition to the 3D Image tracking. With the introduction of these feature in the real scene, the user is able to press directly these buttons like he would do in a real device, with the difference that here these objects are augmented objects layered over the real environment instead of real objects, but the behavior will be the same.


Tracking the World in AR Scenarios

Posted on: No Comments

The general goal of AR is to integrate and supplement the real world with additional contextual information. In light of this, the capability of understanding the surroundings in an automatic fashion is a fundamental and crucial need. Among others, 2 main tasks can be identified in the processing pipeline enabling this capability in a machine: I) object recognition, and II) tracking object motion in successive time instants. In particular, due to the rapid evolution of the AR needs, efficient and reliable tracking techniques are becoming essential.

Fig. 1 - 3D information estimation from 2D projections

Considering that the image is the 2D projection of a 3D scene, strictly speaking the tracking in image sequences is always 2-dimensional. However, injecting prior information about the 3D geometry of the surrounding can enable the estimation of 3D information from the observed 2D motion: this particular processing is usually referred to as 3D tracking. Relying on this concept, several interesting tasks can be performed, like estimating the 3D object trajectories and the object pose, or inferring the camera 3D motion, or derive the 3D structure of the scene. The theoretical complexity and the computational demand of these tasks is far from being trivial, and particularly sophisticated methods have to be implemented in order to assure a good trade-off between accuracy and timing performance. In fact, other than “regular” 2D object/feature tracking, a further processing phase is requested in order to derive/fit the 3D information considering the prior knowledge. This is particularly true when dealing with devices at medium/low processing and memory capabilities, like mobile phones.

The exploitation of 3D information is essential in AR scenarios for different tasks involving both planar surfaces (e.g., printed pictures) and full 3D solid objects (e.g., boxes, cylinders). In particular, the former can be considered as a simplified case of the latter: in fact, when tracking features over a planar target the influence of object self-occlusions can be ignored with little impact on the overall framework performance. As an example, let’s consider the tracking of a book cover. The target is planar and its projection in the camera frame is also planar. This allows a complete modelling of the relationship between target object and its projection in the image in terms of a simple homography matrix. Then, according to the requirements, one can extract 3D information from the homography, allowing a 3D pose estimation of the target in the space.

Fig. 2 - Iterative 3D pose estimation process


Computer Vision for Augmented Reality

Posted on: 1 Comment

Augmented Reality (AR) is a exponentially growing area in the so-called Virtual Environment (VE) field. While VE, or Virtual Reality (VR), provides a complete immersive experience into a fully synthetic scenario, where the user cannot see and interact with the real surroundings, AR allows to see the real world, placing virtual objects or superimposing virtual information. In other words, AR does not substitute reality, but integrates and supplements it. It takes the real objects as a foundation to add contextual information helping the user to deep his understanding about the subject. The potential application domains of this technology are vast, including medical, education and training, manufacturing and repair, annotation and visualization, path planning, gaming and entertainment, military. The related market is exploding: according recent studies, the installed base of AR-capable mobile device has grown from 8 million in 2009 to more than 100 million in 2010, producing a global revenue that is estimated to reach $1.5 billion by 2015.

Figure 1. Different AR scenarios

But, besides the final AR tangible (or visible) results, and market predictions, one can ask: what is the enabling technology that effectively allows augmenting our reality? In other words, beyond the virtual information rendered, how can the device, or the application, be aware of the world and select the appropriate content to present to the user?

From a generic point of view, this task is far from being trivial: in fact, while for a human being the surrounding understanding is somehow unconsciously and easily reached in almost all the scenarios in fractions of second, for a computer-based machine the things are way more complicated. What is done in the practice is to constrain the scenario and sense the world status through multi-modal sensors (i.e., images, videos, sounds, inertial sensors): the discrete information flow is then fused, merged, and processed by so-called Artificial Intelligence (AI) algorithms that try to give a plausible explanation to the provided data. 

Very close to AI, often intersecting and relying on it, and strongly related to AR application development is the Computer Vision. In fact, since the main cue for the actual AR systems is the artificial vision, this field has gained increasing importance in the AR context. As AI aims at the surroundings understanding relying on generic low-level sensor data, Computer Vision is concerned with duplicating, or emulating the capabilities of the Human Vision System (HVS) by reconstructing, and interpreting, a 3D scene only from its 2D projections, or images. Although it may seem simple, any task related to Computer Vision can become arbitrarily complex: this is due to the intrinsic nature of the problem, so-called ill-posed inverse. In other words, a 3D scene understanding has to be reached from its 2D projections, in fact losing one spatial dimension. Furthermore, given the AR interactivity constraints, the tasks have to be performed in real-time, or near real-time.

Figure 2. Typical Computer VIsion processing pipeline


Markerless Augmented Reality

Posted on: 9 Comments

          Image or object recognition and 3D Object Tracking in Augmented Reality is not a new concept and is, or has been, already enabled mainly by visual markers.Visual markers have been widely used in existing AR applications in last years. In most of these applications, the performance of an AR system depends highly on the tracking method for visual marker detection,  pose estimation, and so depending on the particular application. The visual marker’s design can differ from one to another. But the use of these visual markers limit the interactivity and are constrained to a range of photos or objects encapsulated within a border to create the marker. Therefore, in order to use this approach, these visual marks have to be printed previously and also be kept for future uses. Unlike in the marker-based Augmented reality systems, in markerless augmented reality systems any part of the real environment may be used as a target that can be tracked in order to place virtual objects.

      An example of AR using visual markers.

          With the new advances in mobile technologies, both in hardware and software, new markerless approaches like the ones based on natural features, broke into the Augmented Reality world, not only allowing to use real objects as a target instead of these old and ugly markers, but also overcome some of their limitations.

         In order to perform the object tracking, markerless augmented reality systems rely in natural features instead of fiducial marks. Therefore, there are no ambient intrusive markers which are not really part of the environment. Furthermore, markerless augmented reality counts on specialized and robust trackers already available. Another advantage of the markerless systems is the possibility of extracting from the environment characteristics and information that may later be used by them. However, among the disadvantages we can consider for markerless augmented reality systems is that tracking and registration techniques become more complex.

An example of Markerless AR (MAR).