Simultaneous Localization And Mapping:
Present, Future, and the Robust-Perception Age
Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif,
Davide Scaramuzza, Jos´e Neira, Ian D. Reid, John J. Leonard
1
6
1
0
2
n
u
J
9
1
]
O
R
.
s
c
[
1
v
0
3
8
5
0
.
6
0
6
1
:
v
i
X
r
a
Abstract—Simultaneous Localization and Mapping (SLAM)
consists in the concurrent construction of a representation of
the environment (the map), and the estimation of the state of
the robot moving within it. The SLAM community has made
astonishing progress over the last 30 years, enabling large-scale
real-world applications, and witnessing a steady transition of this
technology to industry. We survey the current state of SLAM. We
start by presenting what is now the de-facto standard formulation
for SLAM. We then review related work, covering a broad set of
topics including robustness and scalability in long-term mapping,
metric and semantic representations for mapping, theoretical
performance guarantees, active SLAM and exploration, and
other new frontiers. The paper serves as a tutorial for the
non-expert reader. It is also a position paper: by looking at
the published research with a critical eye, we delineate open
challenges and new research issues, that still deserve careful
scientific investigation. The paper also contains the authors’ take
on two questions that often animate discussions during robotics
conferences: do robots need SLAM? Is SLAM solved?
I. INTRODUCTION
SLAM consists in the simultaneous estimation of the state
of the robot and the map of the environment. In simple
instances, the robot state is described by its pose (position
and orientation), although other quantities may be included in
the state, such as robot velocity, sensor biases, and calibration
parameters. The map, on the other hand, is a representation
of aspects of interest (e.g., position of landmarks, obstacles)
describing the environment in which the robot operates.
The need to build a map of the environment is twofold.
First, the map is often required to support other tasks; for
instance, a map can inform path planning or provide an
intuitive visualization for a human operator. Second, the map
allows limiting the error committed in estimating the state of
the robot. In the absence of a map, dead-reckoning would
quickly drift over time; on the other hand, given a map, the
robot can “reset” its localization error by re-visiting known
C. Cadena is with the Autonomous Systems Lab, ETH Zurich, Switzerland.
E-mail: cesarc@eth.ch
L. Carlone is with the Laboratory for Information and Decision Systems,
Massachusetts Institute of Technology, USA. E-mail: lcarlone@mit.edu
H. Carrillo is with the Escuela de Ciencias Exactas e Ingenier´ıa, Universidad
Sergio Arboleda, Colombia. E-mail: henry.carrillo@usa.edu.co
Y. Latif and I.D. Reid are with the Australian Center for Robotic Vi-
sion, University of Adelaide, Australia. E-mail: yasir.latif@adelaide.edu.au,
ian.reid@adelaide.edu.au
J. Neira is with the Dept. Inform´atica e Ingenier´ıa de Sistemas, Universidad
de Zaragoza, Spain. E-mail: jneira@unizar.es
D. Scaramuzza is with the Robotics and Perception Group, University of
Zurich, Switzerland. E-mail: sdavide@ifi.uzh.ch
J.J. Leonard is with Marine Robotics Group, the Massachusetts Institute of
Technology, USA. E-mail: jleonard@mit.edu
areas (the so called loop closure). Therefore, SLAM finds
applications in all scenarios in which a prior map is not
available and needs to be built.
In some robotics applications a map of the environment is
known a priori. For instance, a robot operating on a factory
floor can be provided with a manually built map of artificial
beacons placed in the environment. Another example is the
case in which the robot has access to GPS measurements (the
GPS satellites can be considered as moving beacons at known
locations). The popularity of the SLAM problem is connected
with the emergence of indoor applications of mobile robotics.
Indoor operation rules out
the use of GPS to bound the
localization error; furthermore, SLAM provides an appealing
alternative to user-built maps, showing that robot operation is
possible in absence of an ad-hoc localization infrastructure.
A thorough historical review of the first 20 years of the
SLAM problem is discussed by Durrant-Whyte and Bailey
in [13, 86]. The surveys [13, 86] mainly cover what we
call the classical age (1986-2004); the classical age saw the
introduction of the main probabilistic formulations for SLAM,
including approaches based on Extended Kalman Filters, Rao-
Blackwellised Particle Filters, and maximum likelihood esti-
mation; moreover, it delineated the first challenges connected
to efficiency and robust data association. The subsequent
period is what we call the algorithmic-analysis age (2004-
2015), and is partially covered by Dissanayake et al. in [84].
The algorithmic analysis period saw the study of fundamental
properties of SLAM, including observability, convergence, and
consistency. In this period, the key role of sparsity towards
efficient SLAM solvers was also understood, and the main
open-source SLAM libraries were developed.
We review the main SLAM surveys to date in Table I,
observing that most recent surveys only cover specific aspects
or sub-fields of SLAM. The popularity of SLAM in the last 30
years is not surprising if one thinks about the manifold aspects
that SLAM involves. At the lower level (called the front-end
in Section II) SLAM naturally intersects other research fields
such as computer vision and signal processing; at the higher
level (that we later call the back-end), SLAM is an appealing
mix of geometry, graph theory, optimization, and probabilistic
estimation. Finally, a SLAM expert has to deal with practical
aspects ranging from sensor modeling to system integration.
The present paper gives a broad overview of the current
state of SLAM, and offers the perspective of part of the
community on the open problems and future directions for
the SLAM research. The paper summarizes the outcome of
the workshop “The Problem of Mobile Sensors: Setting future
TABLE I: Surveying the surveys
Year Topic
Reference
2
2006 Probabilistic
and data association
approaches
Durrant-Whyte and Bailey [13, 86]
2008 Filtering approaches
Aulinas et al. [12]
2008 Visual SLAM
Neira et al. (special issue) [200]
2011 SLAM back-end
2011 Observability,
Consistency, Convergence
Grisetti et al. [117]
Dissanayake et al. [84]
2012 Visual Odometry
Scaramuzza and Fraundofer
248]
[102,
2016 Multi robot SLAM
Saeedi et al. [244]
2016 Visual Place recognition
Lowry et al. [177]
goals and indicators of progress for SLAM” [35], held during
the Robotics: Science and System (RSS) conference (Rome,
July 2015). While we often mention topological mapping, our
main focus is on metric and semantic SLAM, and we refer
the reader to the recent survey of Lowry et al. [177] for
a more comprehensive coverage of vision-based topological
SLAM and place recognition. Before delving into the paper,
we provide our take on two questions that often animate
discussions during robotics conferences.
Do autonomous robots really need SLAM? Answer-
ing this question requires understanding what makes SLAM
unique. SLAM aims at building a globally consistent rep-
resentation of the environment, leveraging both ego-motion
measurements and loop closures. The keyword here is “loop
closure”:
if we sacrifice loop closures, SLAM reduces to
odometry. In early applications, odometry was obtained by
integrating wheel encoders. The pose estimate obtained from
wheel odometry quickly drifts, making the estimate unusable
after few meters; this was one of the main thrusts behind the
development of SLAM: the observation of external landmarks
is useful to reduce the trajectory drift and possibly correct it.
However, more recent odometry algorithms are based on visual
and inertial information, and have very small drift (< 0.5% of
the trajectory length). Hence the question becomes legitimate:
do we really need SLAM? Our answer is three-fold.
First of all, we observe that
the SLAM research done
over the last decade was the one producing the visual-inertial
odometry algorithms that currently represent the state of the
art; in this sense visual-inertial navigation (VIN) is SLAM:
VIN can be considered a reduced SLAM system, in which the
loop closure (or place recognition) module is disabled. More
in general, SLAM provided an excuse to study sensor fusion
under more challenging setups (i.e., no GPS,
low quality
sensors) than previously considered in other literatures (e.g.,
inertial navigation in aerospace engineering).
The second answer regards loop closures. A robot perform-
ing odometry and neglecting loop closures interprets the world
as an “infinite corridor” (Fig. 1a) in which the robot keeps
exploring new areas indefinitely. A loop closure event informs
the robot that this “corridor” keeps intersecting itself (Fig. 1b).
The advantage of loop closure now becomes clear: by finding
Fig. 1: Left: map built from odometry. The map is homotopic to a long corridor
that goes from the starting position A to the final position B. Points that are
close in reality (e.g., B and C) may be arbitrarily far in the odometric map.
Right: map build from SLAM. By leveraging loop closures, SLAM estimates
the actual topology of the environment, and “discovers” shortcuts in the map
loop closures, the robot understands the real topology of the
environment, and is able to find shortcuts between locations
(e.g., point B and C in the map). Therefore, if getting the right
topology of the environment is one of the merits of SLAM,
why not simply drop the metric information and just do place
recognition? The answer is simple: the metric information
makes place recognition much simpler and more robust; the
metric reconstruction informs the robot about loop closure
opportunities and allows discarding spurious loop closures.
Therefore, while SLAM might be redundant in principle (an
oracle place recognition module would suffice for topological
mapping), SLAM offers a natural defense against wrong data
association and perceptual aliasing, where similarly looking
scenes, corresponding to distinct locations in the environment,
would deceive place recognition. In this sense, the SLAM map
provides a way to predict and validate future measurements:
we believe that this mechanism is key to robust operation.
The third answer is that SLAM is needed since many appli-
cations implicitly or explicitly do require a globally consistent
map. For instance, in many military and civil applications the
goal of the robot is to explore an environment and report
a map to the human operator. Another example is the case
in which the robot has to perform structural inspection (of a
building, bridge, etc.); also in this case a globally consistent
3D reconstruction is a requirement for successful operation.
One can identify tasks for which different flavors of SLAM
formulations are more suitable than others. Therefore, when a
roboticist has to design a SLAM system, he/she is faced with
multiple design choices. For instance, a topological map can
be used to analyze reachability of a given place, but it is not
suitable for motion planning and low-level control; a locally-
consistent metric map is well-suited for obstacle avoidance and
local interactions with the environment, but it may sacrifice
accuracy; a globally-consistent metric map allows the robot to
perform global path planning, but it may be computationally
demanding to compute and maintain. A more general way
to choose the most appropriate SLAM system is to think
about SLAM as a mechanism to compute a sufficient statistic
that summarizes all past observations of the robot, and in
this sense, which information to retain in this compressed
representation is deeply task-dependent.
Is SLAM solved? This is another question that is often
asked within the robotics community, c.f. [106]. The difficulty
of answering this question lies in the question itself: SLAM
is such a broad topic that the question is well posed only
for a given robot/environment/performance combination. In
particular, one can evaluate the maturity of the SLAM problem
once the following aspects are specified:
• robot: type of motion (e.g., dynamics, maximum speed),
available sensors (e.g., resolution, sampling rate), avail-
able computational resources;
• environment: planar or three-dimensional, presence of
natural or artificial landmarks, amount of dynamic ele-
ments, amount of symmetry and risk of perceptual alias-
ing. Note that many of these aspects actually depend on
the sensor-environment pair: for instance, two rooms may
look identical for a 2D laser scanner (perceptual aliasing),
while a camera may discern them from appearance cues;
• performance requirements: desired accuracy in the esti-
mation of the state of the robot, accuracy and type of
representation of the environment (e.g., landmark-based
or dense), success rate (percentage of tests in which the
accuracy bounds are met), estimation latency, maximum
operation time, maximum size of the mapped area.
For instance, mapping a 2D indoor environment with a robot
equipped with wheel encoders and a laser scanner, with
sufficient accuracy (< 10cm ) and sufficient robustness (say,
low failure rate), can be considered largely solved. Similarly,
vision-based SLAM with slowly-moving robots (e.g., Mars
rovers, domestic robots), and visual-inertial odometry can be
considered mature research fields. On the other hand, other
robot/environment/performance combinations still deserve a
large amount of fundamental research. Current SLAM algo-
rithms can be easily induced to fail when either the motion
of the robot or the environment are too challenging (e.g.,
fast robot dynamics, highly dynamic environments); similarly,
SLAM algorithms are often unable to face strict performance
requirements, e.g., high rate estimation for fast closed-loop
control. This survey will provide a comprehensive overview
of these open problems, among others.
We conclude this section with some broader considerations
about the future of SLAM. We argue that we are entering
in a third era for SLAM, the robust-perception age, which is
characterized by the following key requirements:
1) robust performance: the SLAM system operates with low
failure rate for an extended period of time in arbitrary
environments; the system includes fail-safe mechanisms
and has self-tuning capabilities1 in that it can adapt the
selection of the system parameters to the scenario.
2) high-level understanding: the SLAM system goes be-
yond basic geometry reconstruction to obtain a high-
level understanding of the environment (e.g., semantic,
affordances, high-level geometry, physics);
3) resource awareness:
the SLAM system is tailored to
the available sensing and computational resources, and
provides means to adjust the computation load depending
on the available resources;
4) task-driven inference: the SLAM system produces adap-
tive map representations, whose complexity can change
depending on the task that the robot has to perform.
1The SLAM community has been largely affected by the “curse of manual
tuning”, in that satisfactory operation is enabled by expert tuning of the system
parameters (e.g., stopping conditions, thresholds for outlier rejection).
3
Fig. 2: Front-end and back-end in a typical SLAM system.
Paper organization. The paper starts by presenting a stan-
dard formulation and “architecture” for SLAM (Section II).
Section III tackles robustness in life-long SLAM. Section IV
deals with scalability. Section V discusses how to represent
the geometry of the environment. Section VI extends the
question of the environment representation to the modeling
of semantic information. Section VII provides an overview
of the current accomplishments on the theoretical aspects of
SLAM. Section VIII broadens the discussion and reviews the
active SLAM problem in which decision making is used to
improve the quality of the SLAM results. Section IX provides
an overview of recent trends in SLAM, including the use
of unconventional sensors and new formulations. Section X
provides final remarks. Throughout
the paper, we provide
many pointers to related work outside the robotics commu-
nity. Despite its unique traits, SLAM is related to problems
addressed in computer vision, computer graphics, and control
theory, and cross-fertilization among these fields is a necessary
condition to enable fast progress.
For the non-expert reader, we recommend to read Durrant-
Whyte and Bailey’s SLAM tutorials [13, 86] before delving
in this position paper. The more experienced researcher can
jump directly to the section of interest, where he/she will
find a self-contained overview of the state of the art and open
problems.
II. ANATOMY OF A MODERN SLAM SYSTEM
A SLAM system can be conceptually divided into two parts:
the front-end and the back-end. The front-end abstracts sensor
data into models that are amenable for estimation, while the
back-end performs inference on the abstracted data produced
by the front-end. This architecture is summarized in Fig. 2. We
review both components, starting from the back-end.
Maximum-a-posteriori (MAP) estimation and the SLAM
back-end. The current de-facto standard formulation of SLAM
has its origins in the seminal paper of Lu and Milios [178],
followed by the work of Gutmann and Konolige [120]. Since
then, numerous approaches have improved the efficiency and
robustness of the optimization underlying the problem [80,
97, 118, 146, 147, 216, 271]. All these approaches formulate
SLAM as a maximum-a-posteriori estimation problem, and
often use the formalism of factor graphs [163] to reason about
the interdependence among variables.
Assume that we want to estimate an unknown variable X ; as
mentioned before, in SLAM the variable X typically includes
the trajectory of the robot (as a discrete set of poses) and
the position of landmarks in the environment. We are given
a set of measurements Z = {zk : k = 1, . . . , m} such that
each measurement can be expressed as a function of X , i.e.,
zk = hk(Xk) + k, where Xk ⊆ X is a subset of the variables,
hk(·) is a known function (the measurement or observation
model), and k is random measurement noise.
In MAP estimation, we estimate X by computing the
assignment of variables X that attains the maximum of the
posterior P(X|Z) (the belief over X given the measurements):
4
X .
= argmax
X
P(X|Z) = argmax
X
P(Z|X )P(X )
(1)
where the equality follows from the Bayes theorem. In (1),
P(Z|X ) is the likelihood of the measurements Z given the
assignment X , and P(X ) is a prior probability over X . The
prior probability includes any prior knowledge about X ; in
case no prior knowledge is available, P(X ) becomes a constant
(uniform distribution) which is inconsequential and can be
dropped from the optimization. In that case MAP estimation
reduces to maximum likelihood estimation.
Assuming that the measurements Z = {zk : k = 1, . . . , m}
are independent (i.e., the noise terms affecting the measure-
ments are not correlated), problem (1) factorizes into:
Fig. 3: SLAM as a factor graph: Blue circles denote robot poses at
consecutive time steps (x1, x2, . . .), green circles denote landmark positions
(1, 2, . . .), red circle denotes the variable associated with the intrinsic
calibration parameters (K). Factors are shows are black dots: the label “u”
marks factors corresponding to odometry constraints, “m” marks factors
corresponding to camera observations, “c” denotes loop closures, and “p”
denotes prior factors.
for some given function h0(·), prior mean z0, and information
matrix Ω0. Since maximizing the posterior is the same as
minimizing the negative log-posterior,
the MAP estimate
in (2) becomes:
X = argmax
X
P(X )
P(zk|X ) = P(X )
P(zk|Xk)
X = argmin
X
−log
m
k=1
m
k=1
(2)
where, on the right-hand-side, we noticed that zk only depends
on the subset of variables in Xk.
Problem (2) can be interpreted in terms of inference over
a factors graph [163]. The variables correspond to nodes in
the factor graph. The terms P(zk|Xk) and the prior P(X )
are called factors, and they encode probabilistic constraints
over a subset of nodes. A factor graph is a graphical model
that encodes the dependence between the k-th factor (and
its measurement zk) and the corresponding variables Xk.
A first advantage of the factor graph interpretation is that
it enables an insightful visualization of the problem. Fig. 3
shows an example of factor graph underlying a simple SLAM
problem (more details in Example 1 below). The figure
shows the variables, namely, the robot poses, the landmarks
positions, and the camera calibration parameters, and the
factors imposing constraints among these variables. A second
advantage is generality: a factor graph can model complex
inference problems with heterogeneous variables and factors,
and arbitrary interconnections; the connectivity of the factor
graph in turns influences the sparsity of the resulting SLAM
problem as discussed below.
In order to write (2) in a more explicit form, assume that
the measurement noise k is a zero-mean Gaussian noise
with information matrix Ωk. Then, the measurement likelihood
in (2) becomes:
P(zk|Xk) ∝ exp(− 1
2
where we use the notation ||e||2
that the prior is Normally distributed:
||hk(Xk) − zk||2
Ωk
)
(3)
Ω = eTΩe. Similarly, assume
P(X ) ∝ exp(− 1
2
||h0(X ) − z0||2
Ω0
)
(4)
P(zk|Xk)
=
m
P(X )
m
k=0
k=1
1
2
argmin
X
||hk(Xk) − zk||2
Ωk
(5)
which is a nonlinear least squares problem, as in most prob-
lems of interest in robotics, hk(·) is a nonlinear function.
Note that the formulation (5) follows from the assumption of
Normally distributed noise. Other assumptions for the noise
distribution lead to different cost functions; for instance, if the
noise follows a Laplace distribution, the squared 2-norm in (5)
is replaced by the 1-norm. To increase resilience to outliers, it
is also common to substitute the squared 2-norm in (5) with
robust loss functions (e.g., Huber or Tukey loss) [133].
The computer vision expert may notice a resemblance
between problem (5) and bundle adjustment (BA) in Structure
from Motion [275]; both (5) and BA indeed stem from a
maximum-a-posteriori formulation. However, two key features
make SLAM unique. First, the factors in (5) are not con-
strained to model projective geometry as in BA, but include a
broad variety of sensor models, e.g., inertial sensors, wheel
encoders, GPS, to mention a few. Second, problem (5) is
designed to be solved incrementally: new measurements are
made available at each time step as the robot moves.
The minimization problem (5) is commonly solved via
successive linearizations, e.g.,
the Gauss-Newton (GN) or
the Levenberg-Marquardt methods (more modern approaches,
based on convex relaxations and Lagrangian duality are re-
viewer in Section VII). The Gauss-Newton method proceeds
iteratively, starting from a given initial guess ˆX . At each
iteration, GN approximates the minimization (5) as
δX = argmin
δX
1
2
||Ak δX −bk||2
Ωk
= argmin
δX
1
2
||A δX −b||2
(6)
Ω
m
k=0
= ∂hk(X )
.
∂X
where δX is a small “correction” w.r.t. the linearization point
ˆX , Ak
is the Jacobian of the measurement function
hk(·) with respect to X , bk
= zk − h( ˆX ) is the residual error
.
at ˆX ; on the right-hand-side of (6), A (resp. b) is obtained by
stacking Ak (resp. bk); Ωk is a block diagonal matrix including
the measurement information matrices Ωk as diagonal blocks.
The optimal correction δX which minimizes (6) can be
computed in closed form as:
δX = (ATΩA)−1ATΩb
(7)
and, at each iteration, the linearization point is updated via
ˆX ← ˆX + δX . The matrix (ATΩA) is usually referred to as
the Hessian. So far we assumed that X belongs to a vector
space (for this reason the sum ˆX + δX is well defined). When
X includes variables belonging to a smooth manifold (e.g.,
rotations), the structure of the GN method remains unaltered,
but the usual sum (e.g., ˆX + δX ) is substituted with a suitable
mapping, called retraction [1]. In the robotics literature, the
retraction is often denoted with the operator ⊕ and maps
the “small correction” δX defined in the tangent space of
the manifold at ˆX , to an element of the manifold, i.e., the
linearization point is updated as ˆX ← ˆX ⊕ δX .
The key insight behind modern SLAM solvers, is that the
Jacobian matrix A appearing in (7) is sparse and its sparsity
structure is dictated by the topology of the underlying factor
graph. This enables the use of fast linear solvers to compute
δX [146, 147, 164, 229]. Moreover, it allows designing in-
cremental (or online) solvers, which update the estimate of
X as new observations are acquired [146, 147, 229]. Current
SLAM libraries (e.g., GTSAM [83], g2o [164], Ceres [3],
iSAM [146], and SLAM++ [229]) are able to solve problems
with tens of thousands of variables in few seconds.
The SLAM formulation described so far is commonly
referred to as maximum-a-posteriori estimation, factor graph
optimization, graph-SLAM, full smoothing, or smoothing and
mapping (SAM). A popular particularization of this framework
is pose graph optimization,
in which the variables to be
estimated are poses sampled along the trajectory of the robot,
and each factor imposes a constraint on a pair of poses.
MAP estimation has been proven to be more accurate
and efficient than original approaches for SLAM, based on
nonlinear filtering. We refer the reader to the survey [13, 86]
for an overview on filtering approaches, and to [264] for
a comparison between filtering and smoothing approaches.
However, we remark that some SLAM systems based on
EKF have also been demonstrated to attain state-of-the-art
performance. Excellent examples of EKF-based SLAM sys-
tems include the Multi-State Constraint Kalman Filter of
Mourikis and Roumeliotis [193], and the VIN systems of
Kottas et al. [160] and Hesch et al. [126]. Not surprisingly, the
performance mismatch between filtering and MAP estimation
gets smaller when the linearization point for EKF is accurate
(as it happens in visual-inertial navigation problems), when
using sliding-window filters, and when potential sources of
inconsistency in EKF are taken care of [126, 130, 160].
As discussed in the next section, MAP estimation is usually
performed on a pre-processed version of the sensor data. In
this regard, it is often referred to as the SLAM back-end.
5
Sensor-dependent SLAM front-end. In practical robotics
applications, it might be hard to write directly the sensor
measurements as an analytic function of the state, as required
in MAP estimation. For instance, if the raw sensor data is an
image, it might be hard to express the intensity of each pixel
as a function of the SLAM state; the same difficulty arises
with simpler sensors (e.g., a laser with a single beam). In both
cases the issue is connected with the fact that we are not able to
design a sufficiently general, yet tractable representation of the
environment; even in presence of such general representation,
it would be hard to write an analytic function that connects
the measurements to the parameters of such representation.
For this reason, before the SLAM back-end, it is common
to have a module, the front-end, that extracts relevant features
from the sensor data. For instance, in vision-based SLAM,
the front-end extracts the pixel location of few distinguishable
points in the environment; pixel observations of these points
are now easy to model within the back-end (see Example 1).
The front-end is also in charge of associating each measure-
ment to a specific landmark (say, 3D point) in the environment:
this is the so called data association. More abstractly, the data
association module associates each measurement zk with a
subset of unknown variables Xk such that zk = hk(Xk) + k.
Finally, the front-end might also provide an initial guess for
the variables in the nonlinear optimization (5).
A pictorial representation of a typical SLAM system is given
in Fig. 2. The data association module in the front-end includes
a short-term data association block and a long-term one. Short-
term data association is responsible for associating correspond-
ing features in consecutive sensor measurements; for instance,
short-term data association would track the fact that 2 pixel
measurements in consecutive frames are picturing the same
3D point. On the other hand, long-term data association (or
loop closure) is in charge of associating new measurements to
older landmarks. We remark that the back-end usually feeds
feedback information to the front-end, e.g., to support loop
closure detection and validation.
The pre-processing that happens in the front-end is sensor
dependent, since the notion of feature changes depending on
the input data stream we consider. In the colored box below
we provide an example of SLAM for a robot equipped with
a monocular camera and a 3D laser scanner.
III. LONG-TERM AUTONOMY I: ROBUSTNESS
A SLAM system might be fragile in many aspects: failure
can be algorithmic2 or hardware-related. The former class
includes failure modes induced by limitation of the exist-
ing SLAM algorithms (e.g., difficulty to handle extremely
dynamic environments). The latter includes failures due to
sensor or actuator degradation. Explicitly addressing these
failure modes is crucial for long-term operation, where one
can no longer take simplified assumptions on the structure
of the environment (e.g., mostly static) or fully rely on on-
board sensors. In this section we review the main challenges
2We omit the (large) class of software-related failures, while the non-expert
reader must be aware that integration and testing are key aspects of SLAM
and robotics in general.
Example 1: SLAM with camera and laser. We con-
sider the SLAM problem of Fig. 3, describing a robot
equipped with an uncalibrated monocular camera and a
time t is X =
3D laser scanner. The SLAM state at
{x1, . . . , xt, l1, . . . , ln, K}, where xi is the 3D pose of the
robot at time i, lj is the 3D position of the j-th landmark,
and K is the unknown camera calibration.
The laser front-end feeds laser scans to the iterative-closest-
point (ICP) algorithm to compute the relative motion (odom-
etry) between two consecutive laser scans. Using the odom-
etry measurements, the back-end defines odometry factors,
denoted with “u” in Fig. 3. Each odometry factor connects
consecutive poses, and the corresponding measurement zu
i is
described by the measurement model:
i
i(xi, xi+1) ⊕ u
zu
i = hu
(8)
i(·) is a function computing the relative pose be-
where hu
tween xi and xi+1, and u
i is a zero-mean Gaussian noise
with information matrix Ωu
i. In (8), the standard sum is
replaced with the retraction ⊕ which maps a vector to an
element of SE(3), the manifold of 3D poses. Loop closing
factors, denoted with “c” in Fig. 3, generalize the odometry
factors, in that they measure the relative pose between non-
sequential robot poses:
ij
ij(xi, xj) ⊕ c
zc
ij = hc
ij has information matrix Ωc
ij.
(9)
ij(·) computes the relative pose between xi and xj,
where hc
and the noise c
The visual front-end extracts visual features from each cam-
era image, acquired at time i; then, the data association
module matches each feature observation to a specific 3D
landmark lj. From this information, the back-end defines
vision factors, denoted with “v” in Fig. 3. Each vision factor
includes a measurement zv
ij and its measurement model is
zv
ij = hv
ij(xi, lj, K) + v
(10)
where zv
ij is the pixel measurement of the projection of
landmark lj onto the image plane at time i. The function
ij(·) is a standard perspective projection.
hv
Finally, the prior factor “p” encodes our prior knowledge of
the initial pose of the robot:
ij
zp = x1 ⊕ v
ij
(11)
ij has information matrix Ωp.
where zp is the prior mean and v
Since the on-board sensors in our example cannot measure
any external reference frame, we conventionally set the prior
mean zp to be the identity transformation.a
Given all these factors, the SLAM back-end estimates X by
minimizing the following nonlinear least-squares problem:
i∈O
X
k = minXk
ijhc
zc
zpx12
Ωp +
ij(xi, xj)2
+
Ωc
ij
zc
ijhc
ij−hv
ij(xi, xj)2
ij(xi, lj, K)2
Ωu
i
zv
+
Ωv
ij
ij∈P
ij∈L
where can be roughly understood as a “difference” on the
manifold SE(3), and O, L and P identify the set of odometry,
loop closure, and vision factors, respectively. Note how the
front-end performs an “abstraction” of the sensor data into
relative pose measurements in the factor graph.
aThe prior is useful to make the MAP estimate unique: in absence
of a global reference frame, the MAP estimator would admit an
infinite number of solutions, which are equal up to roto-translation.
6
to algorithmic robustness. We then discuss open problems,
including robustness against hardware-related failures.
in feature-based visual SLAM,
One of the main sources of algorithmic failures is data as-
sociation. As mentioned in Section II data association matches
each measurement to the portion of the state the measurement
refers to. For instance,
it
associates each visual feature to a specific landmark. Per-
ceptual aliasing, the phenomenon for which different sensory
inputs lead to the same sensor signature, makes this problem
particularly hard. In such a case, data association establishes
wrong measurement-state matches (false positives), which in
turns result in wrong estimates from the back-end.
The situation is made worse by the presence of unmodeled
dynamics in the environment including both short-term and
seasonal changes, which might deceive short-term and long-
term data association. A fairly common assumption in current
SLAM approaches is that the world remains unchanged as
the robot moves through it (in other words, landmarks are
static). This static world assumption holds true in a single
mapping run in small scale scenarios, as long as there are
no short
term dynamics (e.g., people and objects moving
around). When mapping over longer time scales and in large
environments, change is inevitable. The variations over day
and night, seasonal changes such as foliage, or change in the
structure of the environment such as old building giving way
to new constructions, all affect the performance of a SLAM
systems. For instance. systems relying on the repeatability of
visual features fail in drastic illumination changes between
day and night, while methods leveraging the geometry of the
environment might fail when old structures disappear.
Another aspect of robustness is that of doing SLAM in
harsh environments such as underwater [19, 89, 92, 151]. The
challenge in this case is the limited visibility, the constantly
changing conditions, and the impossibility of using conven-
tional sensors (e.g., laser range finder).
Brief Survey. Robustness issues connected to wrong data
association can be addressed in the front-end and/or in the
back-end of a SLAM system. Traditionally,
the front-end
has been entrusted to establishing correct data association.
Short-term data association is the easiest one to tackle: if the
sampling rate of the sensor is relatively fast, compared to the
dynamics of the robot, tracking features that correspond to
the same 3D landmark is easy. For instance, if we want to
track a 3D point across consecutive images and assuming that
the framerate is sufficiently high, standard approaches based
on descriptor matching and optical flow [248] ensure reliable
tracking. Intuitively, at high framerate, the viewpoint of the
sensor (camera, laser) does not change significantly, hence the
features at time t + 1 remain close to the ones observed at
time t.3 Long-term data association in the front-end is more
challenging and involves loop closure detection and validation.
For loop closure detection at the front-end, a brute-force ap-
proach which detects features in the current measurement (e.g.,
image) and tries to match them against all previously detected
features quickly becomes impractical. Bag-of-words [256]
3In hindsight, the fact that short-term data association is much easier and
more reliable than the long-term one is what makes (visual, inertial) odometry
simpler than SLAM.
avoid this intractability by quantizing the feature space and
allowing more efficient search. Bag-of-words can be arranged
into hierarchical vocabulary trees [208] that enable efficient
lookup in large-scale datasets. Bag-of-words-based techniques
such as [71, 72, 108] have shown reliable performance on the
task of single session loop closure detection. However, these
approaches are not capable of handling severe illumination
variations as visual words can no longer be matched. This has
led to develop new methods that explicitly account for such
variations by matching sequences [192], gathering different
visual appearances into a unified representation [64], or using
spatial as well as appearance information [127]. A detailed
survey on visual place recognition can be found in Lowry et
al. [177]. Feature-based methods have also been used to detect
loop closures in laser-based SLAM front-ends; for instance,
Tipaldi et al. [273] propose FLIRT features for 2D laser scans.
Loop closure validation, instead, consists in additional ge-
ometric verification steps to ascertain the quality of the loop
closure. In vision-based applications, RANSAC is commonly
used for geometric verification and outlier rejection, see [248]
and the references therein. In laser-based approaches, one can
validates a loop closure by checking how well the current laser
scan matches the existing map (i.e., how small is the residual
error resulting from scan matching).
Despite the progress made to robustify loop closure detec-
tion at the front-end, in presence of perceptual aliasing, it is
unavoidable that wrong loop closures are fed to the back-
end. Wrong loop closures can severely corrupt the quality
of the MAP estimate [268]. In order to deal with this issue,
a recent line of research [2, 49, 167, 214, 267] proposes
techniques to make the SLAM back-end resilient against
spurious measurements. These methods reason on the validity
of loop closure constraints by looking at the residual error
induced by the constraints during optimization. Other methods,
instead, attempt to detect outliers a priori, that is, before any
optimization takes place, by identifying incorrect loop closures
that are not supported by the odometry [215, 243].
In dynamic environments, the challenge is twofold. First,
the SLAM system has to detect, discard, or track changes.
While mainstream approaches attempt to discard the dynamic
portion of the scene [199], some works include dynamic
elements as part of the model [241, 282, 283]. The second
challenge consists in the fact that the SLAM system has to
model permanent or semi-permanent changes, and understand
how and when to update the map accordingly. Current SLAM
systems that deal with dynamics either maintain multiple
(time-dependent) maps of the same location [79, 159], or have
a single representation parameterized by some time-varying
parameter [161, 247].
Open Problems. In this section we review open problems
and novel research questions arising in life-long SLAM.
Failsafe SLAM and recovery: Despite the progress made on
the SLAM back-end, current SLAM solvers are still vulnerable
in presence of outliers. This is mainly due to the fact that
virtually all robust SLAM techniques are based on iterative
optimization of nonconvex costs. This has two consequences:
first, the outlier rejection outcome depends on the quality of
the initial guess fed to the optimization; second, the system is
7
inherently fragile: the inclusion of a single outlier degrades
the quality of the estimate, which in turns degrades the
capability of discerning outliers later on. An ideal SLAM
solution should be fail-safe and failure-aware, i.e., the system
needs to be aware of imminent failure (e.g., due to outliers
or degeneracies) and provide recovery mechanisms that can
re-establish proper operation. None of the existing SLAM
approaches provides these capabilities.
Robustness to HW failure: While addressing hardware fail-
ures might appear outside the scope of SLAM, these failures
impact the SLAM system, and the latter can play a key role
in detecting and mitigating sensor and locomotion failures.
If the accuracy of a sensor degrades due to malfunctioning
or aging, the quality of the sensor measurements (e.g., noise,
bias) does not match the noise model used in the back-end
(c.f. eq. (3)), leading to poor estimates. This naturally poses
different research questions: how can we detect degraded
sensor operation? how can we adjust sensor noise statistics
(covariances, biases) accordingly?
Metric Relocalization: While appearance-based, as opposed
to feature-based, methods are able to close loops between a
day and a night sequences [64, 192], or between different
seasons [202] the resulting loop closure is topological
in
nature. For metric relocalization (i.e., estimating the relative
pose with respect to the previously built map), feature-based
approaches are still the norm [108] and cannot be extended
to such circumstances. While vision has become the sensor of
choice for many application and loop closing has become a
sensor signature matching problem [71, 72, 108, 273], other
sensors as well as information inherent to the SLAM problem
might be exploited. For instance, Brubaker et al. [32] use
trajectory matching to address the shortcomings of cameras.
Additionally mapping with one sensor modality and localizing
in the same map with a different sensor modality can be a
useful addition. The work of Wolcott et al. [297] and Forster et
al. [99] is a step in this direction. Another problem is localizing
from significantly different viewpoints with heterogeneous
sensing. Forster et al. [99] study vision-based localization in
a lidar-based map. Majdik et al. [182] study how to localize
a drone in a cadastral 3D model textured from Google Street
View images, Behzadin et al. [20] show how to localize in
hand-drawn maps using a laser scanner, and Winterhalter et
al. [296] carry out RGB-D localization given 2D floor plans.
Time varying and deformable maps: Mainstream SLAM
methods have been developed with the rigid and static world
assumption in mind; however, the real world is non rigid
both due to dynamics as well as the inherent deformability
of objects. An ideal SLAM solution should be able to reason
about dynamics in the environment
including non-rigidity,
work over long time period generating “all
terrain” maps
and be able to do so in real time. In the computer vision
community, there have been several attempts since the 80s
to recover shape from non-rigid objects but with restrictive
applicability, for example Pentland et al. [222, 223] require
prior knowledge of some mechanical properties of the objects,
Bregler et al. in [31] and [274] requires a restricted deforma-
tion of the object and show examples in human faces. Recent
results in non-rigid SfM such as [5, 6, 114] are less restrictive
8
memory footprint are bounded by the resources of the robot.
Therefore, it is important to design SLAM methods whose
computational and memory complexity remains bounded.
In the worst-case, successive linearization methods based
on direct linear solvers imply a memory consumption which
grows quadratically in the number of variables in (6). When
using iterative linear solvers (e.g., the conjugate gradient [82])
the memory consumption grows linearly in the number of
variables. The situation is further complicated by the fact
that, when re-visiting a place multiple times, factor graph
optimization become less efficient as nodes and edges are
continuously added to the same spatial region, compromising
the sparsity structure of the graph.
In this section we review some of the current approaches
to control, or at least reduce, the growth of the size of the
problem and discuss open challenges.
Brief Survey. We focus on two ways to reduce the com-
plexity of factor graph optimization: (i) sparsification methods,
which trade-off information loss for memory and computa-
tional efficiency, and (ii) out-of-core and multi-robot methods,
which split the computation among many robots/processors.
Node and edge sparsification. This family of methods
addresses scalability by reducing the number of nodes added
to the graph, or by pruning less “informative” nodes and
factors. Ila et al. [135] use an information-theoretic approach
to add only non-redundant nodes and highly-informative mea-
surements to the graph. Johannsson et al. [141], when pos-
sible, avoid adding new nodes to the graph by inducing new
constraints between existing nodes, such that the number of
variables grows only with size of the explored space and not
with the mapping duration. Kretzschmar et al. [162] propose
an information-based criterion for determining which nodes to
marginalize in pose graph optimization. Carlevaris-Bianco and
Eustice [37], and Mazuran et al. [186] introduce the Generic
Linear Constraint (GLC) factors and the Nonlinear Graph
Sparsification (NGS), respectively. These methods operate on
the Markov blanket of a marginalized node and compute
a sparse approximation of the blanket. Huang et al. [129]
sparsify the Hessian matrix by solving an 1-regularized
minimization problem.
Out-of-core (parallel) SLAM: Parallel out-of-core algo-
rithms for SLAM split the computation (and memory) load
of factor graph optimization among multiple processors. The
key idea is to divide the factor graph in different subgraphs and
optimize the overall graph by alternating local optimization of
each subgraph, with a global refinement. The corresponding
approaches are often referred to as submapping algorithms.
Ni et al. [205] present an exact submapping approach for factor
graph optimization, and propose to cache the factorization of
the submaps to speed-up computation. Ni and Dellaert [206]
organize submaps in a binary tree structure and use nested
dissection to minimize the dependence between two subtrees.
Zhao et al. [304] present an approximation strategy for large
scale SLAM by solving a sequence of submaps and merging
them in a divide and conquer manner using linear least squares.
Frese et al. [104] propose a multi-level relaxation approach.
Grisetti et al. [116] propose a hierarchy of submaps: whenever
an observation is acquired, the highest level of the hierarchy
Fig. 4: Mapping large-scale real-world environments poses formidable chal-
lenges to the existing SLAM algorithms. Perceptual aliasing and dynamics of
the environment (moving objects, appearance and geometry changes across
seasons) call for algorithms that are inherently robust to these phenomena,
as well are fail detection and recovery techniques. The extended scale of the
environment calls for estimation algorithms that can be easily parallelized,
as well as better map representation that store information in a compact and
actionable way.
but still work in small scenarios. In the SLAM community,
Newcombe et al. [204] have address the non-rigid case for
small-scale reconstruction. However, addressing the problem
of non-rigid maps at a large scale is still largely unexplored.
Automatic parameter tuning: SLAM systems (in particular,
the data association modules) require extensive parameter
tuning in order to work correctly for a given scenario. These
parameters include thresholds that control feature matching,
RANSAC parameters, and criteria to decide when to add new
factors to the graph or when to trigger a loop closing algorithm
to search for matches. If SLAM has to work “out of the box”
in arbitrary scenarios, methods for automatic tuning of the
involved parameters need to be considered.
IV. LONG-TERM AUTONOMY II: SCALABILITY
While modern SLAM algorithms have been successfully
demonstrated in indoor building-scale environments (Fig. 4,
top), in many application endeavors, robots must operate for
an extended period of time over larger areas (Fig. 4, bottom).
These applications include ocean exploration for environmen-
tal monitoring, 24/7 cleaning robots in our cities, or large-
scale precision agriculture. For such applications the size of
the factor graph underlying SLAM can grow unbounded, due
to the continuous exploration of new places and the increasing
time of operation. In practice, the computational time and