The general goal of AR is to integrate and supplement the real world with additional contextual information. In light of this, the capability of understanding the surroundings in an automatic fashion is a fundamental and crucial need. Among others, 2 main tasks can be identified in the processing pipeline enabling this capability in a machine: I) object recognition, and II) tracking object motion in successive time instants. In particular, due to the rapid evolution of the AR needs, efficient and reliable tracking techniques are becoming essential.
Fig. 1 - 3D information estimation from 2D projections
Considering that the image is the 2D projection of a 3D scene, strictly speaking the tracking in image sequences is always 2-dimensional. However, injecting prior information about the 3D geometry of the surrounding can enable the estimation of 3D information from the observed 2D motion: this particular processing is usually referred to as 3D tracking. Relying on this concept, several interesting tasks can be performed, like estimating the 3D object trajectories and the object pose, or inferring the camera 3D motion, or derive the 3D structure of the scene. The theoretical complexity and the computational demand of these tasks is far from being trivial, and particularly sophisticated methods have to be implemented in order to assure a good trade-off between accuracy and timing performance. In fact, other than “regular” 2D object/feature tracking, a further processing phase is requested in order to derive/fit the 3D information considering the prior knowledge. This is particularly true when dealing with devices at medium/low processing and memory capabilities, like mobile phones.
The exploitation of 3D information is essential in AR scenarios for different tasks involving both planar surfaces (e.g., printed pictures) and full 3D solid objects (e.g., boxes, cylinders). In particular, the former can be considered as a simplified case of the latter: in fact, when tracking features over a planar target the influence of object self-occlusions can be ignored with little impact on the overall framework performance. As an example, let’s consider the tracking of a book cover. The target is planar and its projection in the camera frame is also planar. This allows a complete modelling of the relationship between target object and its projection in the image in terms of a simple homography matrix. Then, according to the requirements, one can extract 3D information from the homography, allowing a 3D pose estimation of the target in the space.
Fig. 2 - Iterative 3D pose estimation process