Multi-View Stereo: A Tutorial
Yasutaka Furukawa
Washington University in St. Louis
furukawa@wustl.edu
Carlos Hernández
Google Inc.
carloshernandez@google.com
Contents
1 Introduction
1.1
Imagery collection . . . . . . . . . . . . . . . . . . . . . .
1.2 Camera projection models . . . . . . . . . . . . . . . . . .
1.3 Structure from Motion . . . . . . . . . . . . . . . . . . .
1.4 Bundle Adjustment
. . . . . . . . . . . . . . . . . . . . .
1.5 Multi-View Stereo . . . . . . . . . . . . . . . . . . . . . .
2 Multi-view Photo-consistency
2.1 Photo-consistency measures . . . . . . . . . . . . . . . . .
2.2 Visibility estimation in state-of-the-art algorithms . . . . .
2
5
7
9
12
13
16
17
31
3 Algorithms: From Photo-Consistency to 3D Reconstruction 37
43
61
71
83
3.1 Depthmap Reconstruction . . . . . . . . . . . . . . . . . .
3.2 Point-cloud Reconstruction . . . . . . . . . . . . . . . . .
3.3 Volumetric data fusion . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
3.4 MVS Mesh Refinement
4 Multi-view Stereo and Structure Priors
97
99
. . . . . . 105
Image Classification for Structure Priors . . . . . . . . . . 107
4.1 Departure from Depthmap to Planemap . . . . . . . . . .
4.2 Departure from Planes to Geometric Primitives
4.3
2
3
5 Software, Best Practices, and Successful Applications
114
5.1 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.2 Best practices for Image Acquisition . . . . . . . . . . . . 115
5.3 Successful Applications
. . . . . . . . . . . . . . . . . . . 117
6 Limitations and Future Directions
123
6.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2 Open Problems
. . . . . . . . . . . . . . . . . . . . . . . 126
. . . . . . . . . . . . . . . . . . . . . . . . . 129
6.3 Conclusions
Acknowledgements
References
130
131
Abstract
This tutorial presents a hands-on view of the field of multi-view stereo
with a focus on practical algorithms. Multi-view stereo algorithms are
able to construct highly detailed 3D models from images alone. They
take a possibly very large set of images and construct a 3D plausible
geometry that explains the images under some reasonable assumptions,
the most important being scene rigidity. The tutorial frames the multi-
view stereo problem as an image/geometry consistency optimization
problem. It describes in detail its main two ingredients: robust im-
plementations of photometric consistency measures, and efficient opti-
mization algorithms. It then presents how these main ingredients are
used by some of the most successful algorithms, applied into real appli-
cations, and deployed as products in the industry. Finally it describes
more advanced approaches exploiting domain-specific knowledge such
as structural priors, and gives an overview of the remaining challenges
and future research directions.
1
Introduction
Reconstructing 3D geometry from photographs is a classic Computer
Vision problem that has occupied researchers for more than 30 years. Its
applications range from 3D mapping and navigation to online shopping,
3D printing, computational photography, computer video games, or
cultural heritage archival. Only recently however have these techniques
matured enough to exit the laboratory controlled environment into the
wild, and provide industrial scale robustness, accuracy and scalability.
Modeling the 3D geometry of real objects or scenes is a chal-
lenging task that has seen a variety of tools and approaches ap-
plied such as Computer Aided Design (CAD) tools [3], arm-mounted
probes, active methods [110, 131, 11, 10] and passive image-based meth-
ods [162, 165, 176]. Among all, passive image-based methods, the sub-
ject of this tutorial, provide a fast way of capturing accurate 3D content
at a fraction of the cost of other approaches. The steady increase of im-
age resolution and quality has turned digital cameras into cheap and
reliable high resolution sensors that can generate outstanding quality
3D content.
The goal of an image-based 3D reconstruction algorithm can be de-
scribed as ”given a set of photographs of an object or a scene, estimate
2
3
Figure 1.1: Image-based 3D reconstruction. Given a set of photographs (left), the
goal of image-based 3D reconstruction algorithms is to estimate the most likely 3D
shape that explains those photographs (right).
the most likely 3D shape that explains those photographs, under the
assumptions of known materials, viewpoints, and lighting conditions”
(See Figure 1.1). The definition highlights the difficulty of the task,
namely the assumption that materials, viewpoints, and lighting are
known. If these are not known, the problem is generally ill-posed since
multiple combinations of geometry, materials, viewpoints, and lighting
can produce exactly the same photographs. As a result, without fur-
ther assumptions, no single algorithm can correctly reconstruct the 3D
geometry from photographs alone. However, under a set of reasonable
extra assumptions, e.g. rigid Lambertian textured surfaces, state-of-
the-art techniques can produce highly detailed reconstructions even
from millions of photographs.
There exist many cues that can be used to extract geometry from
photographs: texture, defocus, shading, contours, and stereo correspon-
dence. The latter three have been very successful, with stereo corre-
spondence being the most successful in terms of robustness and the
number of applications. Multi-view stereo (MVS) is the general term
given to a group of techniques that use stereo correspondence as their
main cue and use more than two images [165, 176].
All the MVS algorithms described in the following chapters assume
the same input: a set of images and their corresponding camera param-
eters. This chapter gives an overview of an MVS pipeline starting from
4
Introduction
Figure 1.2: Example of a multi-view stereo pipeline. Clockwise: input imagery,
posed imagery, reconstructed 3D geometry, textured 3D geometry.
photographs alone. An important take-home message of this chapter is
simple: An MVS algorithm is only as good as the quality of the input
images and camera parameters. Moreover, a large part of the recent
success of MVS is due to the success of the underlying Structure from
Motion (SfM) algorithms that compute the camera parameters.
Figure 1.2 provides a sketch of a generic MVS pipeline. Different
applications may use different implementations of each of the main
blocks, but the overall approach is always similar:
• Collect images,
• Compute camera parameters for each image,
• Reconstruct the 3D geometry of the scene from the set of images
and corresponding camera parameters.
• Optionally reconstruct the materials of the scene.
1.1.
Imagery collection
5
Figure 1.3: Different MVS capture setups. From left to right: a controlled MVS
capture using diffuse lights and a turn table, outdoor capture of small-scale scenes,
and crowd-sourcing from online photo-sharing websites.
In the chapter we will give more insight into the first three main
stages of MVS: imagery collection, camera parameters estimation, and
3D geometry reconstruction. Chapter 2 develops the notion of photo-
consistency as the main signal being optimized by MVS algorithms.
Chapter 3 presents and compares some of the most successful MVS al-
gorithms. Chapter 4 discusses the use of domain knowledge, in particu-
lar, structural priors in improving the reconstruction quality. Chapter 5
gives an overview of successful applications, available software, and best
practices. Finally Chapter 6 describes some of the current limitations
of MVS as well as research directions to solve them.
1.1
Imagery collection
One can roughly classify MVS capture setups into three categories (See
Figure 1.3):
• Laboratory setting,
• Outdoor small-scale scene capture,
• Large-scale scene capture using fleets or crowd-sourcing, e.g.,
cars, planes, drones, and Internet.
MVS algorithms first started in a laboratory setting [184, 147, 58],
where the light conditions could be easily controlled and the camera