logo资料库

视觉slam综述论文.pdf

第1页 / 共27页
第2页 / 共27页
第3页 / 共27页
第4页 / 共27页
第5页 / 共27页
第6页 / 共27页
第7页 / 共27页
第8页 / 共27页
资料共27页,剩余部分请下载后查看
I Introduction
II Anatomy of a Modern SLAM System
III Long-term autonomy I: Robustness
IV Long-term autonomy II: Scalability
V Representation I: Metric Reasoning
VI Representation II: Semantic Reasoning
VII New theoretical tools for SLAM
VIII Active SLAM
IX New Sensors and Other Frontiers
IX-A New and Unconventional Sensors for SLAM
IX-B New Frontiers: Deep Learning
X Conclusion
1 Past, Present, and Future of Simultaneous Localization And Mapping: Towards the Robust-Perception Age Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, Jos´e Neira, Ian D. Reid, John J. Leonard 6 1 0 2 l u J 0 2 ] O R . s c [ 2 v 0 3 8 5 0 . 6 0 6 1 : v i X r a Abstract—Simultaneous Localization and Mapping (SLAM) the consists in the concurrent construction of a model of environment (the map), and the estimation of the state of the robot moving within it. The SLAM community has made astonishing progress over the last 30 years, enabling large-scale real-world applications, and witnessing a steady transition of this technology to industry. We survey the current state of SLAM. We start by presenting what is now the de-facto standard formulation for SLAM. We then review related work, covering a broad set of topics including robustness and scalability in long-term mapping, metric and semantic representations for mapping, theoretical performance guarantees, active SLAM and exploration, and other new frontiers. This paper simultaneously serves as a position paper and tutorial to those who are users of SLAM. By looking at the published research with a critical eye, we delineate open challenges and new research issues, that still deserve careful scientific investigation. The paper also contains the authors’ take on two questions that often animate discussions during robotics conferences: Do robots need SLAM? and Is SLAM solved? Index Terms—Robots, SLAM, localization, mapping. I. INTRODUCTION S LAM consists in the simultaneous estimation of the state of a robot equipped with on-board sensors, and the con- struction of a model (the map) of the environment that the sensors are perceiving. In simple instances, the robot state is described by its pose (position and orientation), although other quantities may be included in the state, such as robot velocity, sensor biases, and calibration parameters. The map, on the other hand, is a representation of aspects of interest (e.g., position of landmarks, obstacles) describing the environment in which the robot operates. The need to build a map of the environment is twofold. First, the map is often required to support other tasks; for instance, a map can inform path planning or provide an C. Cadena is with the Autonomous Systems Lab, ETH Zurich, Switzerland. e-mail: cesarc@ethz.ch L. Carlone is with the Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, USA. e-mail: lcarlone@mit.edu H. Carrillo is with the Escuela de Ciencias Exactas e Ingenier´ıa, Universidad Sergio Arboleda, Colombia. e-mail: henry.carrillo@usa.edu.co Y. Latif I.D. Reid and Vision, for e-mail: Robotic yasir.latif@adelaide.edu.au, ian.reid@adelaide.edu.au J. Neira is with the Departamento de Inform´atica e Ingenier´ıa de Sistemas, the Australian Center are with University Australia. Adelaide, of Universidad de Zaragoza, Spain. e-mail: jneira@unizar.es D. Scaramuzza is with the Robotics and Perception Group, University of Zurich, Switzerland. e-mail: sdavide@ifi.uzh.ch J.J. Leonard is with Marine Robotics Group, Massachusetts Institute of Technology, USA. e-mail: jleonard@mit.edu intuitive visualization for a human operator. Second, the map allows limiting the error committed in estimating the state of the robot. In the absence of a map, dead-reckoning would quickly drift over time; on the other hand, given a map, the robot can “reset” its localization error by re-visiting known areas (the so called loop closure). Therefore, SLAM finds applications in all scenarios in which a prior map is not available and needs to be built. In some robotics applications a map of the environment is known a priori. For instance, a robot operating on a factory floor can be provided with a manually-built map of artificial beacons in the environment. Another example is the case in which the robot has access to GPS data (the GPS satellites can be considered as moving beacons at known locations). The popularity of the SLAM problem is connected with the emergence of indoor applications of mobile robotics. Indoor operation rules out the use of GPS to bound the localization error; furthermore, SLAM provides an appealing alternative to user-built maps, showing that robot operation is possible in the absence of an ad-hoc localization infrastructure. A thorough historical review of the first 20 years of the SLAM problem is given by Durrant-Whyte and Bailey in two surveys [14, 94]. These mainly cover what we call the classical age (1986-2004); the classical age saw the intro- duction of the main probabilistic formulations for SLAM, including approaches based on Extended Kalman Filters, Rao- Blackwellised Particle Filters, and maximum likelihood esti- mation; moreover, it delineated the basic challenges connected to efficiency and robust data association. Other two excellent references describing the three main SLAM formulations of the classical age are the chapter of Thrun and Leonard [299, chapter 37] and the book of Thrun, Burgard, and Fox [298]. The subsequent period is what we call the algorithmic-analysis age (2004-2015), and is partially covered by Dissanayake et al. in [89]. The algorithmic analysis period saw the study of fundamental properties of SLAM, including observability, convergence, and consistency. In this period, the key role of sparsity towards efficient SLAM solvers was also understood, and the main open-source SLAM libraries were developed. We review the main SLAM surveys to date in Table I, observing that most recent surveys only cover specific aspects or sub-fields of SLAM. The popularity of SLAM in the last 30 years is not surprising if one thinks about the manifold aspects that SLAM involves. At the lower level (called the front-end in Section II) SLAM naturally intersects other research fields
2 such as computer vision and signal processing; at the higher level (that we later call the back-end), SLAM is an appealing mix of geometry, graph theory, optimization, and probabilistic estimation. Finally, a SLAM expert has to deal with practical aspects ranging from sensor modeling to system integration. The present paper gives a broad overview of the current state of SLAM, and offers the perspective of part of the community on the open problems and future directions for the SLAM research. The paper summarizes the outcome of the workshop “The Problem of Mobile Sensors: Setting future goals and indicators of progress for SLAM” [40], held during the Robotics: Science and System (RSS) conference (Rome, July 2015). Our main focus is on metric and semantic SLAM, and we refer the reader to the recent survey of Lowry et al. [198] for a more comprehensive coverage of vision-based topological SLAM and place recognition. Before delving into the paper, we provide our take on two questions that often animate discussions during robotics conferences. Do autonomous robots really need SLAM? Answer- ing this question requires understanding what makes SLAM unique. SLAM aims at building a globally consistent rep- resentation of the environment, leveraging both ego-motion measurements and loop closures. The keyword here is “loop closure”: if we sacrifice loop closures, SLAM reduces to odometry. In early applications, odometry was obtained by integrating wheel encoders. The pose estimate obtained from wheel odometry quickly drifts, making the estimate unusable after few meters [162, Ch. 6]; this was one of the main thrusts behind the development of SLAM: the observation of external landmarks is useful to reduce the trajectory drift and possibly correct it [227]. However, more recent odometry algorithms are based on visual and inertial information, and have very small drift (< 0.5% of the trajectory length [109]). Hence the question becomes legitimate: do we really need SLAM? Our answer is three-fold. First of all, we observe that the SLAM research done over the last decade was the one producing the visual-inertial odometry algorithms that currently represent the state of the art [109, 201]; in this sense visual-inertial navigation (VIN) is SLAM: VIN can be considered a reduced SLAM system, in which the loop closure (or place recognition) module is disabled. More generally, SLAM led to study sensor fusion under more challenging setups (i.e., no GPS, low quality sensors) than previously considered in other literature (e.g., inertial navigation in aerospace engineering). The second answer regards loop closures. A robot perform- ing odometry and neglecting loop closures interprets the world as an “infinite corridor” (Fig. 1a) in which the robot keeps exploring new areas indefinitely. A loop closure event informs the robot that this “corridor” keeps intersecting itself (Fig. 1b). The advantage of loop closure now becomes clear: by finding loop closures, the robot understands the real topology of the environment, and is able to find shortcuts between locations (e.g., point B and C in the map). Therefore, if getting the right topology of the environment is one of the merits of SLAM, why not simply drop the metric information and just do place recognition? The answer is simple: the metric TABLE I: Surveying the surveys Year Topic Reference 2006 Probabilistic and data association approaches Durrant-Whyte and Bailey [14, 94] 2008 Filtering approaches Aulinas et al. [12] 2008 Visual SLAM Neira et al. (special issue) [220] 2011 SLAM back-end 2011 Observability, consistency and convergence Grisetti et al. [129] Dissanayake et al. [89] 2012 Visual odometry Scaramuzza and Fraundofer 274] [115, 2016 Multi robot SLAM Saeedi et al. [271] 2016 Visual place recognition Lowry et al. [198] information makes place recognition much simpler and more robust; the metric reconstruction informs the robot about loop closure opportunities and allows discarding spurious loop clo- sures [187, 295]. Therefore, while SLAM might be redundant in principle (an oracle place recognition module would suffice for topological mapping), SLAM offers a natural defense against wrong data association and perceptual aliasing, where similarly looking scenes, corresponding to distinct locations in the environment, would deceive place recognition. In this sense, the SLAM map provides a way to predict and validate future measurements: we believe that this mechanism is key to robust operation. The third answer is that SLAM is needed since many appli- cations implicitly or explicitly do require a globally consistent map. For instance, in many military and civil applications the goal of the robot is to explore an environment and report a map to the human operator. Another example is the case in which the robot has to perform structural inspection (of a building, bridge, etc.); also in this case a globally consistent 3D reconstruction is a requirement for successful operation. One can identify tasks for which different flavors of SLAM formulations are more suitable than others. Therefore, when a roboticist has to design a SLAM system, he/she is faced with multiple design choices. For instance, a topological map can be used to analyze reachability of a given place, but it is not suitable for motion planning and low-level control; a locally- consistent metric map is well-suited for obstacle avoidance and local interactions with the environment, but it may sacrifice accuracy; a globally-consistent metric map allows the robot to perform global path planning, but it may be computationally demanding to compute and maintain. A more general way to choose the most appropriate SLAM system is to think about SLAM as a mechanism to compute a sufficient statistic that summarizes all past observations of the robot, and in this sense, which information to retain in this compressed representation is deeply task-dependent. Is SLAM solved? This is another question that is often asked within the robotics community, c.f. [118]. The difficulty of answering this question lies in the question itself: SLAM is such a broad topic that the question is well posed only for a given robot/environment/performance combination. In
3 Fig. 2: Front-end and back-end in a typical SLAM system. and has self-tuning capabilities1 in that it can adapt the selection of the system parameters to the scenario. 2) high-level understanding: the SLAM system goes be- yond basic geometry reconstruction to obtain a high- level understanding of the environment (e.g., semantic, affordances, high-level geometry, physics); 3) resource awareness: the SLAM system is tailored to the available sensing and computational resources, and provides means to adjust the computation load depending on the available resources; 4) task-driven inference: the SLAM system produces adap- tive map representations, whose complexity can change depending on the task that the robot has to perform. Paper organization. The paper starts by presenting a standard formulation and architecture for SLAM (Section II). Section III tackles robustness in life-long SLAM. Section IV deals with scalability. Section V discusses how to represent the geometry of the environment. Section VI extends the question of the environment representation to the modeling of semantic information. Section VII provides an overview of the current accomplishments on the theoretical aspects of SLAM. Section VIII broadens the discussion and reviews the active SLAM problem in which decision making is used to improve the quality of the SLAM results. Section IX provides an overview of recent trends in SLAM, including the use of unconventional sensors. Section X provides final remarks. Throughout the paper, we provide many pointers to related work outside the robotics community. Despite its unique traits, SLAM is related to problems addressed in computer vision, computer graphics, and control theory, and cross-fertilization among these fields is a necessary condition to enable fast progress. For the non-expert reader, we recommend to read Durrant- Whyte and Bailey’s SLAM tutorials [14, 94] before delving in this position paper. The more experienced researchers can jump directly to the section of interest, where they will find a self-contained overview of the state of the art and open problems. II. ANATOMY OF A MODERN SLAM SYSTEM The architecture of a SLAM system includes two main components: the front-end and the back-end. The front-end abstracts sensor data into models that are amenable for 1The SLAM community has been largely affected by the “curse of manual tuning”, in that satisfactory operation is enabled by expert tuning of the system parameters (e.g., stopping conditions, thresholds for outlier rejection). Fig. 1: Left: map built from odometry. The map is homotopic to a long corridor that goes from the starting position A to the final position B. Points that are close in reality (e.g., B and C) may be arbitrarily far in the odometric map. Right: map build from SLAM. By leveraging loop closures, SLAM estimates the actual topology of the environment, and “discovers” shortcuts in the map. particular, one can evaluate the maturity of the SLAM problem once the following aspects are specified: • robot: type of motion (e.g., dynamics, maximum speed), available sensors (e.g., resolution, sampling rate), avail- able computational resources; • environment: planar or three-dimensional, presence of natural or artificial landmarks, amount of dynamic ele- ments, amount of symmetry and risk of perceptual alias- ing. Note that many of these aspects actually depend on the sensor-environment pair: for instance, two rooms may look identical for a 2D laser scanner (perceptual aliasing), while a camera may discern them from appearance cues; • performance requirements: desired accuracy in the esti- mation of the state of the robot, accuracy and type of representation of the environment (e.g., landmark-based or dense), success rate (percentage of tests in which the accuracy bounds are met), estimation latency, maximum operation time, maximum size of the mapped area. For instance, mapping a 2D indoor environment with a robot equipped with wheel encoders and a laser scanner, with sufficient accuracy (< 10cm ) and sufficient robustness (say, low failure rate), can be considered largely solved (an example of industrial system performing SLAM is the Kuka Navigation Solution [182]). Similarly, vision-based SLAM with slowly-moving robots (e.g., Mars rovers [203], domes- tic robots [148]), and visual-inertial odometry [126] can be considered mature research fields. On the other hand, other robot/environment/performance combinations still deserve a large amount of fundamental research. Current SLAM algo- rithms can be easily induced to fail when either the motion of the robot or the environment are too challenging (e.g., fast robot dynamics, highly dynamic environments); similarly, SLAM algorithms are often unable to face strict performance requirements, e.g., high rate estimation for fast closed-loop control. This survey will provide a comprehensive overview of these open problems, among others. We conclude this section with some broader considerations about the future of SLAM. We argue that we are entering in a third era for SLAM, the robust-perception age, which is characterized by the following key requirements: 1) robust performance: the SLAM system operates with low failure rate for an extended period of time in a broad set of environments; the system includes fail-safe mechanisms
4 estimation, while the back-end performs inference on the abstracted data produced by the front-end. This architecture is summarized in Fig. 2. We review both components, starting from the back-end. Maximum-a-posteriori (MAP) estimation and the SLAM back-end. The current de-facto standard formulation of SLAM has its origins in the seminal paper of Lu and Milios [199], followed by the work of Gutmann and Konolige [133]. Since then, numerous approaches have improved the efficiency and robustness of the optimization underlying the problem [88, 108, 132, 159, 160, 235, 301]. All these approaches formulate SLAM as a maximum-a-posteriori estimation problem, and often use the formalism of factor graphs [180] to reason about the interdependence among variables. Assume that we want to estimate an unknown variable X ; as mentioned before, in SLAM the variable X typically includes the trajectory of the robot (as a discrete set of poses) and the position of landmarks in the environment. We are given a set of measurements Z = {zk : k = 1, . . . , m} such that each measurement can be expressed as a function of X , i.e., zk = hk(Xk) + k, where Xk ⊆ X is a subset of the variables, hk(·) is a known function (the measurement or observation model), and k is random measurement noise. In MAP estimation, we estimate X by computing the assignment of variables X that attains the maximum of the posterior p(X|Z) (the belief over X given the measurements): X . = argmax X p(X|Z) = argmax X p(Z|X )p(X ) (1) Fig. 3: SLAM as a factor graph: Blue circles denote robot poses at consecutive time steps (x1, x2, . . .), green circles denote landmark positions (l1, l2, . . .), red circle denotes the variable associated with the intrinsic calibration parameters (K). Factors are shows are black dots: the label “u” marks factors corresponding to odometry constraints, “v” marks factors corresponding to camera observations, “c” denotes loop closures, and “p” denotes prior factors. among these variables. A second advantage is generality: a factor graph can model complex inference problems with heterogeneous variables and factors, and arbitrary interconnec- tions; the connectivity of the factor graph in turns influences the sparsity of the resulting SLAM problem as discussed below. In order to write (2) in a more explicit form, assume that the measurement noise k is a zero-mean Gaussian noise with information matrix Ωk (inverse of the covariance matrix). Then, the measurement likelihood in (2) becomes: where the equality follows from the Bayes theorem. In (1), p(Z|X ) is the likelihood of the measurements Z given the assignment X , and p(X ) is a prior probability over X . The prior probability includes any prior knowledge about X ; in case no prior knowledge is available, p(X ) becomes a constant (uniform distribution) which is inconsequential and can be dropped from the optimization. In that case MAP estimation reduces to maximum likelihood estimation. Assuming that the measurements Z are independent (i.e., the noise terms affecting the measurements are not correlated), problem (1) factorizes into: X = argmax p(zk|X ) = p(X ) p(zk|Xk) (2) m m p(X ) X k=1 k=1 where, on the right-hand-side, we noticed that zk only depends on the subset of variables in Xk. Problem (2) can be interpreted in terms of inference over a factors graph [180]. The variables correspond to nodes in the factor graph. The terms p(zk|Xk) and the prior p(X ) are called factors, and they encode probabilistic constraints over a subset of nodes. A factor graph is a graphical model that encodes the dependence between the k-th factor (and its measurement zk) and the corresponding variables Xk. A first advantage of the factor graph interpretation is that it enables an insightful visualization of the problem. Fig. 3 shows an example of factor graph underlying a simple SLAM problem (more details in Example 1 below). The figure shows the variables, namely, the robot poses, the landmarks positions, and the camera calibration parameters, and the factors imposing constraints p(zk|Xk) ∝ exp(− 1 2 where we use the notation ||e||2 that the prior can be written as: p(X ) ∝ exp(− 1 2 ||hk(Xk) − zk||2 Ωk ) (3) Ω = eTΩe. Similarly, assume ||h0(X ) − z0||2 (4) for some given function h0(·), prior mean z0, and information matrix Ω0. Since maximizing the posterior is the same as minimizing the negative log-posterior, the MAP estimate in (2) becomes: Ω0 ) X = argmin X −log m p(X ) m k=0 argmin X p(zk|Xk) = k=1 ||hk(Xk) − zk||2 Ωk (5) which is a nonlinear least squares problem, as in most prob- lems of interest in robotics, hk(·) is a nonlinear function. Note that the formulation (5) follows from the assumption of Normally distributed noise. Other assumptions for the noise distribution lead to different cost functions; for instance, if the noise follows a Laplace distribution, the squared 2-norm in (5) is replaced by the 1-norm. To increase resilience to outliers, it is also common to substitute the squared 2-norm in (5) with robust loss functions (e.g., Huber or Tukey loss) [146]. The computer vision expert may notice a resemblance between problem (5) and bundle adjustment (BA) in Structure from Motion [305]; both (5) and BA indeed stem from a
to mention a few. For instance, maximum-a-posteriori formulation. However, two key features make SLAM unique. First, the factors in (5) are not con- strained to model projective geometry as in BA, but include a broad variety of sensor models, e.g., inertial sensors, wheel encoders, GPS, in laser- based mapping, the factors usually constrain relative poses corresponding to different viewpoints, while in direct methods for visual SLAM, the factors penalize differences in pixel intensities across different views of the same portion of the scene. The second difference with respect to BA is that prob- lem (5) needs to be solved incrementally: new measurements are made available at each time step as the robot moves. The minimization problem (5) is commonly solved via successive linearizations, e.g., the Gauss-Newton (GN) or the Levenberg-Marquardt methods (more modern approaches, based on convex relaxations and Lagrangian duality are re- viewed in Section VII). The Gauss-Newton method proceeds iteratively, starting from a given initial guess ˆX . At each iteration, GN approximates the minimization (5) as m Ωk Ω δX δX k=0 = argmin δX = argmin ||Ak δX − bk||2 ||A δX − b||2 (6) where δX is a small “correction” with respect to the lineariza- tion point ˆX , Ak is the Jacobian of the measurement function hk(·) with respect to X , bk = zk − hk( ˆXk) is the . residual error at ˆX ; on the right-hand-side of (6), A (resp. b) is obtained by stacking Ak (resp. bk); Ωk is a block diagonal matrix including the measurement information matrices Ωk as diagonal blocks. = ∂hk(X ) . ∂X The optimal correction δX which minimizes (6) can be computed in closed form as: δX = (ATΩA)−1ATΩb (7) and, at each iteration, the linearization point is updated via ˆX ← ˆX + δX . The matrix (ATΩA) is usually referred to as the (approximate) Hessian. Note that this matrix is indeed only an approximation of the Hessian of the cost function (5); moreover, its invertibility is connected to the observability properties of the underlying estimation problem. So far we assumed that X belongs to a vector space (for this reason the sum ˆX +δX is well defined). When X includes variables belonging to a smooth manifold (e.g., rotations), the structure of the GN method remains unaltered, but the usual sum (e.g., ˆX + δX ) is substituted with a suitable mapping, called retraction [1]. In the robotics literature, the retraction is often denoted with the operator ⊕ and maps the “small correction” δX defined in the tangent space of the manifold at ˆX , to an element of the manifold, i.e., the linearization point is updated as ˆX ← ˆX ⊕ δX . We also note that the derivation presented so far can be extended to the case in which the noise is not additive, as long as the noise can be written as an explicit function of the state and the measurements. The key insight behind modern SLAM solvers is that the Jacobian matrix A appearing in (7) is sparse and its sparsity structure is dictated by the topology of the underlying factor graph. This enables the use of fast linear solvers to compute 5 δX [159, 160, 183, 252]. Moreover, it allows designing in- cremental (or online) solvers, which update the estimate of X as new observations are acquired [159, 160, 252]. Current SLAM libraries (e.g., GTSAM [86], g2o [183], Ceres [269], iSAM [160], and SLAM++ [252]) are able to solve problems with tens of thousands of variables in few seconds. The hands- on tutorials [86, 129] provide excellent introductions to two of the most popular SLAM libraries; each library also includes a set of examples showcasing real SLAM problems. The SLAM formulation described so far is commonly referred to as maximum-a-posteriori estimation, factor graph optimization, graph-SLAM, full smoothing, or smoothing and mapping (SAM). A popular variation of this framework is pose graph optimization, in which the variables to be estimated are poses sampled along the trajectory of the robot, and each factor imposes a constraint on a pair of poses. MAP estimation has been proven to be more accurate and efficient than original approaches for SLAM based on nonlinear filtering. We refer the reader to the surveys [14, 94] for an overview on filtering approaches, and to [291] for a comparison between filtering and smoothing. However, we remark that some SLAM systems based on EKF have also been demonstrated to attain state-of-the-art performance. Ex- cellent examples of EKF-based SLAM systems include the Multi-State Constraint Kalman Filter of Mourikis and Roume- liotis [214], and the VIN systems of Kottas et al. [176] and Hesch et al. [139]. Not surprisingly, the performance mismatch between filtering and MAP estimation gets smaller when the linearization point for the EKF is accurate (as it happens in visual-inertial navigation problems), when using sliding- window filters, and when potential sources of inconsistency in the EKF are taken care of [139, 143, 176]. As discussed in the next section, MAP estimation is usually performed on a pre-processed version of the sensor data. In this regard, it is often referred to as the SLAM back-end. Sensor-dependent SLAM front-end. In practical robotics applications, it might be hard to write directly the sensor measurements as an analytic function of the state, as required in MAP estimation. For instance, if the raw sensor data is an image, it might be hard to express the intensity of each pixel as a function of the SLAM state; the same difficulty arises with simpler sensors (e.g., a laser with a single beam). In both cases the issue is connected with the fact that we are not able to design a sufficiently general, yet tractable representation of the environment; even in presence of such general representation, it would be hard to write an analytic function that connects the measurements to the parameters of such representation. For this reason, before the SLAM back-end, it is common to have a module, the front-end, that extracts relevant features from the sensor data. For instance, in vision-based SLAM, the front-end extracts the pixel location of few distinguishable points in the environment; pixel observations of these points are now easy to model within the back-end (see Example 1). The front-end is also in charge of associating each measure- ment to a specific landmark (say, 3D point) in the environment: this is the so called data association. More abstractly, the data association module associates each measurement zk with a subset of unknown variables Xk such that zk = hk(Xk) + k.
6 Example 1: SLAM with camera and laser. We con- sider the SLAM problem of Fig. 3, describing a robot equipped with an uncalibrated monocular camera and a time t is X = 3D laser scanner. The SLAM state at {x1, . . . , xt, l1, . . . , ln, K}, where xi is the 3D pose of the robot at time i, lj is the 3D position of the j-th landmark, and K is the unknown camera calibration. The laser front-end feeds laser scans to the iterative-closest- point (ICP) algorithm to compute the relative motion (odom- etry) between two consecutive laser scans. Using the odom- etry measurements, the back-end defines odometry factors, denoted with “u” in Fig. 3. Each odometry factor connects consecutive poses, and the corresponding measurement zu i is described by the measurement model: i i(xi, xi+1) ⊕ u zu i = hu (8) i(·) is a function computing the relative pose be- where hu tween xi and xi+1, and u i is a zero-mean Gaussian noise with information matrix Ωu i. In (8), the standard sum is replaced with the retraction ⊕ which maps a vector to an element of SE(3), the manifold of 3D poses. Loop closing factors, denoted with “c” in Fig. 3, generalize the odometry factors, in that they measure the relative pose between non- sequential robot poses: ij ij(xi, xj) ⊕ c zc ij = hc ij has information matrix Ωc (9) ij(·) computes the relative pose between xi and xj, where hc and the noise c ij. Note how the laser front-end performs an “abstraction” of the laser data into relative pose measurements in the factor graph. The visual front-end extracts visual features from each cam- era image, acquired at time i; then, the data association module matches each feature observation to a specific 3D landmark lj. From this information, the back-end defines vision factors, denoted with “v” in Fig. 3. Each vision factor includes a measurement zv ij and its measurement model is ij(xi, lj, K) + v zv ij = hv (10) where zv ij is the pixel measurement of the projection of landmark lj onto the image plane at time i. The function ij(·) is a standard perspective projection. hv Finally, the prior factor “p” encodes our prior knowledge of the initial pose of the robot: ij zp = x1 ⊕ v ij (11) where zp is the prior mean and v ij has information matrix Ωp. Since the on-board sensors in our example cannot measure any external reference frame, we conventionally set the prior mean zp to the identity transformation.a Given all these factors, the SLAM back-end estimates X by minimizing the following nonlinear least-squares problem: i∈O X k = minXk ijhc zc zpx12 Ωp + ij(xi, xj)2 + Ωc ij zc ijhc ij−hv ij(xi, xj)2 ij(xi, lj, K)2 Ωu i zv + Ωv ij ij∈L where can be roughly understood as a “difference” on the manifold SE(3), [1] and O, L and P identify the set of odometry, loop closure, and vision factors, respectively. ij∈P aThe prior makes the MAP estimate unique: in the absence of a global frame, the MAP estimator would admit an infinite number of solutions, which are equal up to a rigid-body transformation. Finally, the front-end might also provide an initial guess for the variables in the nonlinear optimization (5). For instance, in feature-based monocular SLAM the front-end usually takes care of the landmark initialization, by triangulating the position of the landmark from multiple views. A pictorial representation of a typical SLAM system is given in Fig. 2. The data association module in the front-end includes a short-term data association block and a long-term one. Short- term data association is responsible for associating correspond- ing features in consecutive sensor measurements; for instance, short-term data association would track the fact that 2 pixel measurements in consecutive frames are picturing the same 3D point. On the other hand, long-term data association (or loop closure) is in charge of associating new measurements to older landmarks. We remark that the back-end usually feeds back information to the front-end, e.g., to support loop closure detection and validation. The pre-processing that happens in the front-end is sensor dependent, since the notion of feature changes depending on the input data stream we consider. In Example 1, we provide an example of SLAM for a robot equipped with a monocular camera and a 3D laser scanner. III. LONG-TERM AUTONOMY I: ROBUSTNESS A SLAM system might be fragile in many aspects: failure can be algorithmic2 or hardware-related. The former class includes failure modes induced by limitation of the existing SLAM algorithms (e.g., difficulty to handle extremely dy- namic or harsh environments). The latter includes failures due to sensor or actuator degradation. Explicitly addressing these failure modes is crucial for long-term operation, where one can no longer take simplified assumptions on the structure of the environment (e.g., mostly static) or fully rely on on- board sensors. In this section we review the main challenges to algorithmic robustness. We then discuss open problems, including robustness against hardware-related failures. One of the main sources of algorithmic failures is data association. As mentioned in Section II data association matches each measurement to the portion of the state the measurement refers to. For instance, in feature-based visual SLAM, it associates each visual feature to a specific land- mark. Perceptual aliasing, the phenomenon for which different sensory inputs lead to the same sensor signature, makes this problem particularly hard. In presence of perceptual aliasing, data association establishes wrong measurement-state matches (outliers, or false positives), which in turns result in wrong estimates from the back-end. On the other hand, when data association decides to incorrectly reject a sensor measurement as spurious (false negatives), fewer measurements are used for estimation, at the expenses of the estimation accuracy. The situation is made worse by the presence of unmodeled dynamics in the environment including both short-term and seasonal changes, which might deceive short-term and long- term data association. A fairly common assumption in current 2We omit the (large) class of software-related failures. The non-expert reader must be aware that integration and testing are key aspects of SLAM and robotics in general.
SLAM approaches is that the world remains unchanged as the robot moves through it (in other words, landmarks are static). This static world assumption holds true in a single mapping run in small scale scenarios, as long as there are no short term dynamics (e.g., people and objects moving around). When mapping over longer time scales and in large environments, change is inevitable. The variations over day and night, seasonal changes such as foliage, or change in the structure of the environment such as old building giving way to new constructions, all affect the performance of a SLAM systems. For instance, systems relying on the repeatability of visual features fail in drastic illumination changes between day and night, while methods leveraging the geometry of the environment might fail when old structures disappear. Another aspect of robustness is that of doing SLAM in harsh environments such as underwater [20, 100, 102, 166]. The challenges in this case are the limited visibility, the constantly changing conditions, and the impossibility of using conventional sensors (e.g., laser range finder). Brief Survey. Robustness issues connected to wrong data association can be addressed in the front-end and/or in the back-end of a SLAM system. Traditionally, the front-end has been entrusted to establishing correct data association. Short-term data association is the easiest one to tackle: if the sampling rate of the sensor is relatively fast, compared to the dynamics of the robot, tracking features that correspond to the same 3D landmark is easy. For instance, if we want to track a 3D point across consecutive images and assuming that the framerate is sufficiently high, standard approaches based on descriptor matching or optical flow [274] ensure reliable tracking. Intuitively, at high framerate, the viewpoint of the sensor (camera, laser) does not change significantly, hence the features at time t + 1 remain close to the ones observed at time t.3 Long-term data association in the front-end is more challenging and involves loop closure detection and validation. For loop closure detection at the front-end, a brute-force ap- proach which detects features in the current measurement (e.g., image) and tries to match them against all previously detected features quickly becomes impractical. Bag-of-words [282] avoids this intractability by quantizing the feature space and allowing more efficient search. Bag-of-words can be arranged into hierarchical vocabulary trees [231] that enable efficient lookup in large-scale datasets. Bag-of-words-based techniques such as [79, 80, 121] have shown reliable performance on the task of single session loop closure detection. However, these approaches are not capable of handling severe illumination variations as visual words can no longer be matched. This has led to develop new methods that explicitly account for such variations by matching sequences [213], gathering different visual appearances into a unified representation [69], or using spatial as well as appearance information [140]. A detailed survey on visual place recognition can be found in Lowry et al. [198]. Feature-based methods have also been used to detect loop closures in laser-based SLAM front-ends; for instance, Tipaldi et al. [302] propose FLIRT features for 2D laser scans. 3In hindsight, the fact that short-term data association is much easier and more reliable than the long-term one is what makes (visual, inertial) odometry simpler than SLAM. 7 Loop closure validation, instead, consists of additional ge- ometric verification steps to ascertain the quality of the loop closure. In vision-based applications, RANSAC is commonly used for geometric verification and outlier rejection, see [274] and the references therein. In laser-based approaches, one can validate a loop closure by checking how well the current laser scan matches the existing map (i.e., how small is the residual error resulting from scan matching). Despite the progress made to robustify loop closure detec- tion at the front-end, in presence of perceptual aliasing, it is unavoidable that wrong loop closures are fed to the back- end. Wrong loop closures can severely corrupt the quality of the MAP estimate [295]. In order to deal with this issue, a recent line of research [2, 50, 187, 234, 295] proposes techniques to make the SLAM back-end resilient against spurious measurements. These methods reason on the validity of loop closure constraints by looking at the residual error induced by the constraints during optimization. Other methods, instead, attempt to detect outliers a priori, that is, before any optimization takes place, by identifying incorrect loop closures that are not supported by the odometry [236, 270]. In dynamic environments, the challenge is twofold. First, the SLAM system has to detect, discard, or track changes. While mainstream approaches attempt to discard the dynamic portion of the scene [221], some works include dynamic elements as part of the model [266, 312, 313]. The second challenge regards the fact that the SLAM system has to model permanent or semi-permanent changes, and understand how and when to update the map accordingly. Current SLAM systems that deal with dynamics either maintain multiple (time-dependent) maps of the same location [85, 175], or have a single representation parameterized by some time-varying parameter [177, 273]. Open Problems. In this section we review open problems and novel research questions arising in life-long SLAM. Failsafe SLAM and recovery: Despite the progress made on the SLAM back-end, current SLAM solvers are still vulnerable in presence of outliers. This is mainly due to the fact that virtually all robust SLAM techniques are based on iterative optimization of nonconvex costs. This has two consequences: first, the outlier rejection outcome depends on the quality of the initial guess fed to the optimization; second, the system is inherently fragile: the inclusion of a single outlier degrades the quality of the estimate, which in turns degrades the capability of discerning outliers later on. An ideal SLAM solution should be fail-safe and failure-aware, i.e., the system needs to be aware of imminent failure (e.g., due to outliers or degeneracies) and provide recovery mechanisms that can re-establish proper operation. None of the existing SLAM approaches provides these capabilities. Robustness to HW failure: While addressing hardware fail- ures might appear outside the scope of SLAM, these failures impact the SLAM system, and the latter can play a key role in detecting and mitigating sensor and locomotion failures. If the accuracy of a sensor degrades due to malfunctioning or aging, the quality of the sensor measurements (e.g., noise, bias) does not match the noise model used in the back-end (c.f. eq. (3)), leading to poor estimates. This naturally poses different research questions: how can we detect degraded
8 sensor operation? how can we adjust sensor noise statistics (covariances, biases) accordingly? Metric Relocalization: While appearance-based, as opposed to feature-based, methods are able to close loops between day and night sequences [69, 213], or between different seasons [223] the resulting loop closure is topological in nature. For metric relocalization (i.e., estimating the relative pose with respect to the previously built map), feature-based approaches are still the norm [121] but have not been extended to such circumstances. While vision has become the sensor of choice for many applications and loop closing has become a sensor signature matching problem [79, 80, 121, 302], other sensors as well as information inherent to the SLAM problem might be exploited. For instance, Brubaker et al. [37] use trajectory matching to address the shortcomings of cameras. Additionally mapping with one sensor modality and localizing in the same map with a different sensor modality can be a useful addition. The work of Wolcott et al. [328] and Forster et al. [111] is a step in this direction. Another problem is localizing from significantly different viewpoints with heterogeneous sensing. Forster et al. [111] study vision-based localization in a lidar- based map. Majdik et al. [204] study how to localize a drone in a cadastral 3D model textured from Google Street View images, Behzadin et al. [21] show how to localize in hand- drawn maps using a laser scanner, and Winterhalter et al. [327] carry out RGB-D localization given 2D floor plans. Time varying and deformable maps: Mainstream SLAM methods have been developed with the rigid and static world assumption in mind; however, the real world is non rigid both due to dynamics as well as the inherent deformability of objects. An ideal SLAM solution should be able to reason about dynamics in the environment including non-rigidity, work over long time period generating “all terrain” maps and be able to do so in real time. In the computer vision community, there have been several attempts since the 80s to recover shape from non-rigid objects but with restrictive applicability, for example Pentland et al. [245, 246] require prior knowledge of some mechanical properties of the objects, Bregler et al. in [36] and [304] requires a restricted deforma- tion of the object and show examples in human faces. Recent results in non-rigid SfM such as [4, 5, 128] are less restrictive but still work in small scenarios. In the SLAM community, Newcombe et al. [225] have address the non-rigid case for small-scale reconstruction. However, addressing the problem of non-rigid maps at a large scale is still largely unexplored. Automatic parameter tuning: SLAM systems (in particular, the data association modules) require extensive parameter tuning in order to work correctly for a given scenario. These parameters include thresholds that control feature matching, RANSAC parameters, and criteria to decide when to add new factors to the graph or when to trigger a loop closing algorithm to search for matches. If SLAM has to work “out of the box” in arbitrary scenarios, methods for automatic tuning of the involved parameters need to be considered. IV. LONG-TERM AUTONOMY II: SCALABILITY While modern SLAM algorithms have been successfully demonstrated in indoor building-scale environments (Fig. 4, large-scale environments (e.g., the dynamic city environment at Fig. 4: The current state of the art in SLAM can deal with scenarios with moderate size and complexity (e.g, the indoor scenario in the top subfigure), but the bottom) still pose formidable challenges to the existing algorithms. Perceptual aliasing and dynamics of the environment (moving objects, appearance and geometry changes across seasons) call for algorithms that are inherently robust to these phenomena, as well are fail detection and recovery techniques. The extended scale of the environment calls for estimation algorithms that can be easily parallelized, as well as better map representation that store information in a compact and actionable way. top), in many application endeavors, robots must operate for an extended period of time over larger areas (Fig. 4, bottom). These applications include ocean exploration for environmen- tal monitoring, non-stop cleaning robots in our cities, or large- scale precision agriculture. For such applications the size of the factor graph underlying SLAM can grow unbounded, due to the continuous exploration of new places and the increasing time of operation. In practice, the computational time and memory footprint are bounded by the resources of the robot. Therefore, it is important to design SLAM methods whose computational and memory complexity remains bounded. In the worst-case, successive linearization methods based on direct linear solvers imply a memory consumption which grows quadratically in the number of variables in (7). When using iterative linear solvers (e.g., the conjugate gradient [87]) the memory consumption grows linearly in the number of variables. The situation is further complicated by the fact that, when re-visiting a place multiple times, factor graph optimization becomes less efficient as nodes and edges are continuously added to the same spatial region, compromising the sparsity structure of the graph. In this section we review some of the current approaches to control, or at least reduce, the growth of the size of the problem and discuss open challenges.
分享到:
收藏