Tracking the World in AR Scenarios

The general goal of AR is to integrate and supplement the real world with additional contextual information. In light of this, the capability of understanding the surroundings in an automatic fashion is a fundamental and crucial need. Among others, 2 main tasks can be identified in the processing pipeline enabling this capability in a machine: I) object recognition, and II) tracking object motion in successive time instants. In particular, due to the rapid evolution of the AR needs, efficient and reliable tracking techniques are becoming essential.

Fig. 1 - 3D information estimation from 2D projections

Considering that the image is the 2D projection of a 3D scene, strictly speaking the tracking in image sequences is always 2-dimensional. However, injecting prior information about the 3D geometry of the surrounding can enable the estimation of 3D information from the observed 2D motion: this particular processing is usually referred to as 3D tracking. Relying on this concept, several interesting tasks can be performed, like estimating the 3D object trajectories and the object pose, or inferring the camera 3D motion, or derive the 3D structure of the scene. The theoretical complexity and the computational demand of these tasks is far from being trivial, and particularly sophisticated methods have to be implemented in order to assure a good trade-off between accuracy and timing performance. In fact, other than “regular” 2D object/feature tracking, a further processing phase is requested in order to derive/fit the 3D information considering the prior knowledge. This is particularly true when dealing with devices at medium/low processing and memory capabilities, like mobile phones.

The exploitation of 3D information is essential in AR scenarios for different tasks involving both planar surfaces (e.g., printed pictures) and full 3D solid objects (e.g., boxes, cylinders). In particular, the former can be considered as a simplified case of the latter: in fact, when tracking features over a planar target the influence of object self-occlusions can be ignored with little impact on the overall framework performance. As an example, let’s consider the tracking of a book cover. The target is planar and its projection in the camera frame is also planar. This allows a complete modelling of the relationship between target object and its projection in the image in terms of a simple homography matrix. Then, according to the requirements, one can extract 3D information from the homography, allowing a 3D pose estimation of the target in the space.

Fig. 2 - Iterative 3D pose estimation process

On the contrary, 3D objects moving freely in the space significantly change their appearance according their instant pose, and require a correct handling of the self-occlusions. Think in this case to a cube. As the camera moves around it, the target projected image change showing the different faces according the relative position between camera and object. Also, as the motion evolves always new parts of the target appear on one side and disappear on the other. In this scenario, more complex solutions relying on perspective geometry are required. Iterative methods are usually applied in order to estimate the correct pose of the target. The general idea is to find the pose matrix that minimize the so-called re-projection error (i.e., this is the difference between the projection of the 3D model of the target in a given pose and the image captured from the camera).

Other particular techniques within the family of 3D object tracking aims at recovering the 3D structure of the surroundings (i.e., structure from motion), possibly tracking the 3D camera motion (i.e., SLAM). The idea in this case is to exploit a set of images of the same scene captured from different point of view in order to rebuild the 3D structure of the scene. The underlying technology tries to emulate the human capability of understanding the 3D world, where no a-priori model are required. In fact, the 3D structure model is build on the fly as the camera moves, and it is delivered in terms of point cloud. Several techniques are available, all of these involving significant theoretical and computational effort.

Also in this case the general idea is quite intuitive, but as often happens, the simpler the concept, the harder the practice.

Do not forget to check out our AR Browser and Image Matching SDKs.

Tags: ,
Posted on: No Comments

Leave a Reply