Stereo Camera: Recovering the Scene's Depth

mirella melo
Jan 21, 2024
4 min read

The acquisition of a two-dimensional image leads to losing one dimension: depth. How can we retrieve this data so fundamental to both life and machines?

Beings with eyes infer the distance of objects from several factors. Research points to the influence of textures, formats, and lighting variation acting on a search engine for matching the same "point" of light in different eyes. In addition, factors such as the relative size of objects and their movement with others in the scene, linear perspective, and the individual's own experience help in this estimation. However, how exactly our brain gathers and interprets this information is still not completely understood [1].

In terms of machines, art imitated life. Inspired by the human visual system (and other beings) came the so-called stereo cameras: an arrangement composed of two or more image sensors with fixed positions between them. From there, depth estimation becomes possible through the method known as triangulation. For now, let's visualize this process given a two-camera stereoscopic system, illustrated in the figure below.

the definition of system parameters
the correct association of corresponding points between two images

The system parameters can be calculated from the calibration method. The most common one is made by capturing chessboard images. You can read more about system parameters here. Regarding point association, several techniques are available, but it is still a widely studied problem.

Figure 1 illustrates the scene depth recovery process. Given rectified images, stereo matching algorithms define matching pixels to generate a disparity map. Then, the depth of these points is determined using the triangulation method - described in Stereo Camera: Triangulation Explained.

Figure 1: Depth estimation process from an image pair. Source: author.

Stereo matching

The most complex and error-prone task in Figure 1 is finding the corresponding points between an image pair. This challenge is known as stereo matching and has a broad line of research. When the image pair is horizontally rectified, the searches for corresponding points can be restricted to a horizontal line, representing a horizontal epipolar constraint, which leads to an optimization in the searches. Figure 2 illustrates this restriction; that is, the tip of the heart, in the reference image, is in the left image at pixel 4 of the horizontal line highlighted.

Figure 2: the horizontal epipolar constraint ensures that corresponding points are located on corresponding lines - in an ideal world, of course. Source: author.

This epipolar constraint is conceptualized within epipolar geometry, and you can find out more about this topic in Epipolar Geometry [coming soon].

Disparity map

Stereo matching algorithms, which seek to define equivalent pixels, result in an image known as a disparity map. But what is disparity? Considering an image represented by a 2D matrix, the variables 𝑢𝐿 and 𝑢𝑅 of Figure 3 represent the column position of corresponding points (assuming the horizontal epipolar constraint). As such, the difference between 𝑢𝐿 and 𝑢𝑅 values is called the disparity.

𝑑 = 𝑢𝐿 - 𝑢𝑅

Figure 3: Disparity illustration: The object captured by both images has a horizontally relative distance of 𝑑 pixel units.

We arrive at the disparity map by finding the disparity for all (or some) pixels in the reference image. Usually, the left image is defined as the reference image. In this case, the disparity map stores the number of pixels I must shift to find the corresponding pixel in the right image. In other words: given a pair of undistorted and rectified images, a point 𝑝(𝑖, 𝑗) in the reference image has its match in the other image at 𝑝(𝑖, 𝑗−𝑑), where 𝑑 is the disparity value.

Depending on your application, defining corresponding points can be done for every pixel in the image or only for particular pixels. For example, suppose you want to reconstruct a 3D object from pictures. In that case, it is important to have an entire depth estimate of the scene. On the other hand, to estimate the camera's position in an environment, just a few reference pixels are enough. Therefore, the map can be sparse or dense, as illustrated in Figure 4. In the first case, only the pixels highlighted in green will have the disparity information. In the case of a dense map, each image pixel has a respective disparity value.

Figure 4: Disparity map types. A sparse map only contains the disparity of specific points (e.g., points highlighted in green). Dense calculates the disparity for each pixel in the image. Source: author [2].

Depth

Finally, with the disparity map and knowing the parameters of the cameras, it is possible to calculate the pixel depth. The required equation is shown below. 𝑓 is the focal length of the cameras, 𝑏 is the baseline (distance between the centers of projections of the cameras), 𝑢𝐿 and 𝑢𝑅 the position of corresponding points, and 𝑑 the disparity. If you want to understand how to arrive at this equation, its demonstration can be found in Stereo Camera: Triangulation Explained.

[1] CYGANEK, Boguslaw; SIEBERT, J. Paul. An introduction to 3D computer vision techniques and algorithms. John Wiley & Sons, 2011.

[2] MELO, Mirella Santos Pessoa de. Mapeamento de região navegável a partir de um sistema SLAM e segmentação de imagem. 2021. Masters dissertation. Universidade Federal de Pernambuco.

Stereo Camera: Recovering the Scene's Depth

Stereo matching

Disparity map

Depth

Recent Posts

Comments