logo资料库

ORB-SLAM2--an Open-Source SLAM System for Monocular Stereo and R....pdf

第1页 / 共9页
第2页 / 共9页
第3页 / 共9页
第4页 / 共9页
第5页 / 共9页
第6页 / 共9页
第7页 / 共9页
第8页 / 共9页
资料共9页,剩余部分请下载后查看
I Introduction
II Related Work
II-A Stereo SLAM
II-B RGB-D SLAM
III ORB-SLAM2
III-A Monocular, Close Stereo and Far Stereo Keypoints
III-B System Bootstrapping
III-C Bundle Adjustment with Monocular and Stereo Constraints
III-D Loop Closing and Full BA
III-E Keyframe Insertion
III-F Localization Mode
IV Evaluation
IV-A KITTI Dataset
IV-B EuRoC Dataset
IV-C TUM RGB-D Dataset
IV-D Timing Results
V Conclusion
References
This paper has been accepted for publication in IEEE Transactions on Robotics. DOI: 10.1109/TRO.2017.2705103 IEEE Xplore: http://ieeexplore.ieee.org/document/7946260/ c2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting /republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. 7 1 0 2 n u J 9 1 ] O R . s c [ 2 v 5 7 4 6 0 . 0 1 6 1 : v i X r a
ORB-SLAM2: an Open-Source SLAM System for Monocular, Stereo and RGB-D Cameras Ra´ul Mur-Artal and Juan D. Tard´os 1 Abstract—We present ORB-SLAM2 a complete SLAM system for monocular, stereo and RGB-D cameras, including map reuse, loop closing and relocalization capabilities. The system works in real-time on standard CPUs in a wide variety of environments from small hand-held indoors sequences, to drones flying in industrial environments and cars driving around a city. Our back-end based on bundle adjustment with monocular and stereo observations allows for accurate trajectory estimation with metric scale. Our system includes a lightweight localization mode that leverages visual odometry tracks for unmapped regions and matches to map points that allow for zero-drift localization. The evaluation on 29 popular public sequences shows that our method achieves state-of-the-art accuracy, being in most cases the most accurate SLAM solution. We publish the source code, not only for the benefit of the SLAM community, but with the aim of being an out-of-the-box SLAM solution for researchers in other fields. I. INTRODUCTION Simultaneous Localization and Mapping (SLAM) has been a hot research topic in the last two decades in the Computer Vi- sion and Robotics communities, and has recently attracted the attention of high-technological companies. SLAM techniques build a map of an unknown environment and localize the sensor in the map with a strong focus on real-time operation. Among the different sensor modalities, cameras are cheap and provide rich information of the environment that allows for robust and accurate place recognition. Therefore Visual SLAM solutions, where the main sensor is a camera, are of major interest nowadays. Place recognition is a key module of a SLAM system to close loops (i.e. detect when the sensor returns to a mapped area and correct the accumulated error in exploration) and to relocalize the camera after a tracking failure, due to occlusion or aggressive motion, or at system re-initialization. Visual SLAM can be performed by using just a monocular camera, which is the cheapest and smallest sensor setup. However as depth is not observable from just one camera, the scale of the map and estimated trajectory is unknown. In addition the system bootstrapping require multi-view or filtering techniques to produce an initial map as it cannot be triangulated from the very first frame. Last but not least, This work was supported by the Spanish government under Project DPI2015-67275, the Arag´on regional governmnet under Project DGA T04- FSE and the Ministerio de Educaci´on Scholarship FPU13/04175. R. Mur-Artal was with the Instituto de Investigaci´on en Ingenier´ıa de Arag´on (I3A), Universidad de Zaragoza, 50018 Zaragoza, Spain, until January 2017. He is currently with Oculus Research, Redmond, WA 98052 USA (e- mail: raul.murartal@oculus.com). J. D. Tard´os is with the Instituto de Investigaci´on en Ingenier´ıa de Arag´on (I3A), Universidad de Zaragoza, 50018 Zaragoza, Spain (e-mail: tardos@unizar.es). (a) Stereo input: trajectory and sparse reconstruction of an urban environment with multiple loop closures. (b) RGB-D input: keyframes and dense pointcloud of a room scene with one loop closure. The pointcloud is rendered by backprojecting the sensor depth maps from estimated keyframe poses. No fusion is performed. Fig. 1. ORB-SLAM2 processes stereo and RGB-D inputs to estimate camera trajectory and build a map of the environment. The system is able to close loops, relocalize, and reuse its map in real-time on standard CPUs with high accuracy and robustness. monocular SLAM suffers from scale drift and may fail if performing pure rotations in exploration. By using a stereo or an RGB-D camera all these issues are solved and allows for the most reliable Visual SLAM solutions. In this paper we build on our monocular ORB-SLAM [1] and propose ORB-SLAM2 with the following contributions:
• The first open-source1 SLAM system for monocular, stereo and RGB-D cameras, including loop closing, relo- calization and map reuse. • Our RGB-D results show that by using Bundle Adjust- ment (BA) we achieve more accuracy than state-of-the- art methods based on ICP or photometric and depth error minimization. • By using close and far stereo points and monocular observations our stereo results are more accurate than the state-of-the-art direct stereo SLAM. • A lightweight localization mode that can effectively reuse the map with mapping disabled. Fig. 1 shows examples of ORB-SLAM2 output from stereo and RGB-D inputs. The stereo case shows the final trajectory and sparse reconstruction of the sequence 00 from the KITTI dataset [2]. This is an urban sequence with multiple loop closures that ORB-SLAM2 was able to successfully detect. The RGB-D case shows the keyframe poses estimated in sequence fr1 room from the TUM RGB-D Dataset [3], and a dense pointcloud, rendered by backprojecting sensor depth maps from the estimated keyframe poses. Note that our SLAM does not perform any fusion like KinectFusion [4] or similar, but the good definition indicates the accuracy of the keyframe poses. More examples are shown on the attached video. In the rest of the paper, we discuss related work in Section II, we describe our system in Section III, then present the evaluation results in Section IV and end with conclusions in Section V. II. RELATED WORK In this section we discuss related work on stereo and RGB- D SLAM. Our discussion, as well as the evaluation in Section IV is focused only on SLAM approaches. A. Stereo SLAM A remarkable early stereo SLAM system was the work of Paz et al. [5]. Based on Conditionally Independent Divide and Conquer EKF-SLAM it was able to operate in larger environ- ments than other approaches at that time. Most importantly, it was the first stereo SLAM exploiting both close and far points (i.e. points whose depth cannot be reliably estimated due to little disparity in the stereo camera), using an inverse depth parametrization [6] for the latter. They empirically showed that points can be reliably triangulated if their depth is less than ∼40 times the stereo baseline. In this work we follow this strategy of treating in a different way close and far points, as explained in Section III-A. Most modern stereo SLAM systems are keyframe-based [7] and perform BA optimization in a local area to achieve scalability. The work of Strasdat et al. [8] performs a joint optimization of BA (point-pose constraints) in an inner win- dow of keyframes and pose-graph (pose-pose constraints) in an outer window. By limiting the size of these windows the method achieves constant time complexity, at the expense of not guaranteeing global consistency. The RSLAM of Mei et 1https://github.com/raulmur/ORB SLAM2 2 al. [9] uses a relative representation of landmarks and poses and performs relative BA in an active area which can be constrained for constant-time. RSLAM is able to close loops which allow to expand active areas at both sides of a loop, but global consistency is not enforced. The recent S-PTAM by Pire et al. [10] performs local BA, however it lacks large loop closing. Similar to these approaches we perform BA in a local set of keyframes so that the complexity is independent of the map size and we can operate in large environments. However our goal is to build a globally consistent map. When closing a loop, our system aligns first both sides, similar to RSLAM, so that the tracking is able to continue localizing using the old map and then performs a pose-graph optimization that minimizes the drift accumulated in the loop, followed by full BA. The recent Stereo LSD-SLAM of Engel et al. [11] is a semi-dense direct approach that minimizes photometric error in image regions with high gradient. Not relying on features, the method is expected to be more robust to motion blur or poorly-textured environments. However as a direct method its performance can be severely degraded by unmodeled effects like rolling shutter or non-lambertian reflectance. B. RGB-D SLAM One of the earliest and most famed RGB-D SLAM systems was the KinectFusion of Newcombe et al. [4]. This method fused all depth data from the sensor into a volumetric dense model that is used to track the camera pose using ICP. This system was limited to small workspaces due to its volumetric representation and the lack of loop closing. Kintinuous by Whelan et al. [12] was able to operate in large environments by using a rolling cyclical buffer and included loop closing using place recognition and pose graph optimization. Probably the first popular open-source system was the RGB-D SLAM of Endres et al. [13]. This is a feature-based system, whose front-end computes frame-to-frame motion by feature matching and ICP. The back-end performs pose-graph optimization with loop closure constraints from a heuristic search. Similarly the back-end of DVO-SLAM by Kerl et al. [14] optimizes a pose-graph where keyframe-to-keyframe con- straints are computed from a visual odometry that minimizes both photometric and depth error. DVO-SLAM also searches for loop candidates in a heuristic fashion over all previous frames, instead of relying on place recognition. The recent ElasticFusion of Whelan et al. [15] builds a surfel-based map of the environment. This is a map-centric approach that forget poses and performs loop closing applying a non-rigid deformation to the map, instead of a standard pose-graph optimization. The detailed reconstruction and lo- calization accuracy of this system is impressive, but the current implementation is limited to room-size maps as the complexity scales with the number of surfels in the map. As proposed by Strasdat et al. [8] our ORB-SLAM2 uses depth information to synthesize a stereo coordinate for ex- tracted features on the image. This way our system is agnostic of the input being stereo or RGB-D. Differently to all above methods our back-end is based on bundle adjustment and
3 (a) System Threads and Modules. (b) Input pre-processing Fig. 2. ORB-SLAM2 is composed of three main parallel threads: tracking, local mapping and loop closing, which can create a fourth thread to perform full BA after a loop closure. The tracking thread pre-processes the stereo or RGB-D input so that the rest of the system operates independently of the input sensor. Although it is not shown in this figure, ORB-SLAM2 also works with a monocular input as in [1]. builds a globally consistent sparse reconstruction. Therefore our method is lightweight and works with standard CPUs. Our goal is long-term and globally consistent localization instead of building the most detailed dense reconstruction. However from the highly accurate keyframe poses one could fuse depth maps and get accurate reconstruction on-the-fly in a local area or post-process the depth maps from all keyframes after a full BA and get an accurate 3D model of the whole scene. III. ORB-SLAM2 ORB-SLAM2 for stereo and RGB-D cameras is built on our monocular feature-based ORB-SLAM [1], whose main components are summarized here for reader convenience. A general overview of the system is shown in Fig. 2. The system has three main parallel threads: 1) the tracking to localize the camera with every frame by finding feature matches to the local map and minimizing the reprojection error applying motion-only BA, 2) the local mapping to manage the local map and optimize it, performing local BA, 3) the loop closing to detect large loops and correct the accumulated drift by performing a pose-graph optimization. This thread launches a fourth thread to perform full BA after the pose-graph optimization, to compute the optimal structure and motion solution. The system has embedded a Place Recognition module based on DBoW2 [16] for relocalization, in case of tracking failure (e.g. an occlusion) or for reinitialization in an already mapped scene, and for loop detection. The system maintains a covisibiliy graph [8] that links any two keyframes observing common points and a minimum spanning tree connecting all keyframes. These graph structures allow to retrieve local windows of keyframes, so that tracking and local mapping operate locally, allowing to work on large environments, and serve as structure for the pose-graph optimization performed when closing a loop. The system uses the same ORB features [17] for tracking, mapping and place recognition tasks. These features are robust to rotation and scale and present a good invariance to camera auto-gain and auto-exposure, and illumination changes. More- over they are fast to extract and match allowing for real-time operation and show good precision/recall performance in bag- of-word place recognition [18]. In the rest of this section we present how stereo/depth information is exploited and which elements of the system are affected. For a detailed description of each system block, we refer the reader to our monocular publication [1]. A. Monocular, Close Stereo and Far Stereo Keypoints ORB-SLAM2 as a feature-based method pre-processes the input to extract features at salient keypoint locations, as shown in Fig. 2b. The input images are then discarded and all system operations are based on these features, so that the system is independent of the sensor being stereo or RGB-D. Our system handles monocular and stereo keypoints, which are further classified as close or far. Stereo keypoints are defined by three coordinates xs = (uL, vL, uR), being (uL, vL) the coordinates on the left image and uR the horizontal coordinate in the right image. For stereo cameras, we extract ORB in both images and for every left ORB we search for a match in the right image. This can be done very efficiently assuming stereo rectified images, so that epipolar lines are horizontal. We then generate the stereo keypoint with the coordinates of the left ORB and the horizontal coordinate of the right match, which is subpixel refined by patch correlation. For RGB-D cameras, we extract ORB features on the RGB image and, as proposed by Strasdat et al. [8], for each feature with coordinates (uL, vL) we transform its depth value d into a virtual right coordinate: uR = uL − fxb d (1) where fx is the horizontal focal length and b is the baseline between the structured light projector and the infrared camera, which we approximate to 8cm for Kinect and Asus Xtion.
The uncertainty of the depth sensor is represented by the uncertainty of the virtual right coordinate. In this way, features from stereo and RGB-D input are handled equally by the rest of the system. A stereo keypoint is classified as close if its associated depth is less than 40 times the stereo/RGB-D baseline, as suggested in [5], otherwise it is classified as far. Close keypoints can be safely triangulated from one frame as depth is accurately estimated and provide scale, translation and rotation informa- tion. On the other hand far points provide accurate rotation information but weaker scale and translation information. We triangulate far points when they are supported by multiple views. Monocular keypoints are defined by two coordinates xm = (uL, vL) on the left image and correspond to all those ORB for which a stereo match could not be found or that have an invalid depth value in the RGB-D case. These points are only triangulated from multiple views and do not provide scale information, but contribute to the rotation and translation estimation. B. System Bootstrapping One of the main benefits of using stereo or RGB-D cameras is that, by having depth information from just one frame, we do not need a specific structure from motion initialization as in the monocular case. At system startup we create a keyframe with the first frame, set its pose to the origin, and create an initial map from all stereo keypoints. C. Bundle Adjustment with Monocular and Stereo Constraints Our system performs BA to optimize the camera pose in the tracking thread (motion-only BA), to optimize a local window of keyframes and points in the local mapping thread (local BA), and after a loop closure to optimize all keyframes and points (full BA). We use the Levenberg–Marquardt method implemented in g2o [19]. Motion-only BA optimizes the camera orientation R ∈ SO(3) and position t ∈ R3, minimizing the reprojection error between matched 3D points Xi ∈ R3 in world coordinates and m ∈ R2 or stereo xi s ∈ R3, (·), either monocular xi keypoints xi with i ∈ X the set of all matches: RXi + t2 Σ {R, t} = argmin R,t xi ρ i∈X (·) − π(·) X  = fx where ρ is the robust Huber cost function and Σ the covariance matrix associated to the scale of the keypoint. The projection functions π(·), monocular πm and rectified stereo πs, are defined as follows: πm Y Z X Z + cx Y Z + cy X Z + cx Y Z + cy fy X−b Z + cx fx (3) where (fx, fy) is the focal length, (cx, cy) is the principal point and b the baseline, all known from calibration. Y Z , πs fy X  =  fx (2)  4 Local BA optimizes a set of covisible keyframes KL and all points seen in those keyframes PL. All other keyframes KF , not in KL, observing points in PL contribute to the cost function but remain fixed in the optimization. Defining Xk as the set of matches between points in PL and keypoints in a keyframe k, the optimization problem is the following: {Xi, Rl, tl|i ∈ PL, l ∈ KL} = argmin ρ (Ekj) 2 RkXj + tk k∈KL∪KF Xi,Rl,tl Σ j∈Xk xj Ekj = (·) − π(·) (4) Full BA is the specific case of local BA, where all keyframes and points in the map are optimized, except the origin keyframe that is fixed to eliminate the gauge freedom. D. Loop Closing and Full BA Loop closing is performed in two steps, firstly a loop has to be detected and validated, and secondly the loop is corrected optimizing a pose-graph. In contrast to monocular ORB- SLAM, where scale drift may occur [20], the stereo/depth information makes scale observable and the geometric vali- dation and pose-graph optimization no longer require dealing with scale drift and are based on rigid body transformations instead of similarities. In ORB-SLAM2 we have incorporated a full BA optimiza- tion after the pose-graph to achieve the optimal solution. This optimization might be very costly and therefore we perform it in a separate thread, allowing the system to continue creating map and detecting loops. However this brings the challenge of merging the bundle adjustment output with the current state of the map. If a new loop is detected while the optimization is running, we abort the optimization and proceed to close the loop, which will launch the full BA optimization again. When the full BA finishes, we need to merge the updated subset of keyframes and points optimized by the full BA, with the non-updated keyframes and points that where inserted while the optimization was running. This is done by propagating the correction of updated keyframes (i.e. the transformation from the non-optimized to the optimized pose) to non-updated keyframes through the spanning tree. Non-updated points are transformed according to the correction applied to their reference keyframe. E. Keyframe Insertion ORB-SLAM2 follows the policy introduced in monocular ORB-SLAM of inserting keyframes very often and culling redundant ones afterwards. The distinction between close and far stereo points allows us to introduce a new condition for keyframe insertion, which can be critical in challenging environments where a big part of the scene is far from the stereo sensor, as shown in Fig. 3. In such environment we need to have a sufficient amount of close points to accurately estimate translation, therefore if the number of tracked close points drops below τt and the frame could create at least τc new close stereo points, the system will insert a new keyframe. We empirically found that τt = 100 and τc = 70 works well in all our experiments.
5 COMPARISON OF ACCURACY IN THE KITTI DATASET. TABLE I Sequence 00 01 02 03 04 05 06 07 08 09 10 ORB-SLAM2 (stereo) Stereo LSD-SLAM trel (%) 0.70 1.39 0.76 0.71 0.48 0.40 0.51 0.50 1.05 0.87 0.60 rrel (deg/100m) 0.25 0.21 0.23 0.18 0.13 0.16 0.15 0.28 0.32 0.27 0.27 tabs (m) 1.3 10.4 5.7 0.6 0.2 0.8 0.8 0.5 3.6 3.2 1.0 trel (%) 0.63 2.36 0.79 1.01 0.38 0.64 0.71 0.56 1.11 1.14 0.72 rrel (deg/100m) 0.26 0.36 0.23 0.28 0.31 0.18 0.18 0.29 0.31 0.25 0.33 tabs (m) 1.0 9.0 2.6 1.2 0.2 1.5 1.3 0.5 3.9 5.6 1.5 Fig. 4. Estimated trajectory (black) and ground-truth (red) in KITTI 00, 01, 05 and 07. system outperforms Stereo LSD-SLAM in most sequences, and achieves in general a relative error lower than 1%. The sequence 01, see Fig. 3, is the only highway sequence in the training set and the translation error is slightly worse. Translation is harder to estimate in this sequence because very few close points can be tracked, due to high speed and low frame-rate. However orientation can be accurately estimated, achieving an error of 0.21 degrees per 100 meters, as there are many far point that can be long tracked. Fig. 4 shows some examples of estimated trajectories. Compared to the monocular results presented in [1], the proposed stereo version is able to process the sequence 01 where the monocular system failed. In this highway sequence, see Fig. 3, close points are in view only for a few frames. The ability of the stereo version to create points from just one stereo keyframe instead of the delayed initialization of the monocular, consisting on finding matches between two keyframes, is critical in this sequence not to lose tracking. Moreover the stereo system estimates the map and trajectory with metric scale and does not suffer from scale drift, as seen in Fig. 5. Fig. 3. Tracked points in KITTI 01 [2]. Green points have a depth less than 40 times the stereo baseline, while blue points are further away. In this kind of sequences it is important to insert keyframes often enough so that the amount of close points allows for accurate translation estimation. Far points contribute to estimate orientation but provide weak information for translation and scale. F. Localization Mode We incorporate a Localization Mode which can be useful for lightweight long-term localization in well mapped areas, as long as there are not significant changes in the environment. In this mode the local mapping and loop closing threads are deactivated and the camera is continuously localized by the tracking using relocalization if needed. In this mode the tracking leverages visual odometry matches and matches to map points. Visual odometry matches are matches between ORB in the current frame and 3D points created in the previous frame from the stereo/depth information. These matches make the localization robust to unmapped regions, but drift can be accumulated. Map point matches ensure drift-free localization to the existing map. This mode is demonstrated in the accom- panying video. IV. EVALUATION We have evaluated ORB-SLAM2 in three popular datasets and compared to other state-of-the-art SLAM systems, using always the results published by the original authors and standard evaluation metrics in the literature. We have run ORB-SLAM2 in an Intel Core i7-4790 desktop computer with 16Gb RAM. In order to account for the non-deterministic nature of the multi-threading system, we run each sequence 5 times and show median results for the accuracy of the estimated trajectory. Our open-source implementation includes the calibration and instructions to run the system in all these datasets. A. KITTI Dataset The KITTI dataset [2] contains stereo sequences recorded from a car in urban and highway environments. The stereo sensor has a ∼54cm baseline and works at 10Hz with a resolution after rectification of 1240 × 376 pixels. Sequences 00, 02, 05, 06, 07 and 09 contain loops. Our ORB-SLAM2 detects all loops and is able to reuse its map afterwards, except for sequence 09 where the loop happens in very few frames at the end of the sequence. Table I shows results in the 11 training sequences, which have public ground-truth, compared to the state-of-the-art Stereo LSD-SLAM [11], to our knowledge the only stereo SLAM showing detailed results for all sequences. We use two different metrics, the absolute translation RMSE tabs proposed in [3], and the average relative translation trel and rotation rrel errors proposed in [2]. Our 3002001000100200300x [m]1000100200300400500y [m]0500100015002000x [m]120010008006004002000200y [m]3002001000100200300x [m]1000100200300400y [m]20015010050050x [m]10050050100150y [m]
EUROC DATASET. COMPARISON OF TRANSLATION RMSE (m). TABLE II 6 ORB-SLAM2 (stereo) Stereo LSD-SLAM 0.066 0.074 0.089 - - - - - - - - Sequence V1 01 easy V1 02 medium V1 03 difficult V2 01 easy V2 02 medium V2 03 difficult MH 01 easy MH 02 easy MH 03 medium MH 04 difficult MH 05 difficult 0.035 0.020 0.048 0.037 0.035 X 0.035 0.018 0.028 0.119 0.060 Fig. 6. V1 02 medium, V2 02 medium, MH 03 medium and MH 05 difficutlt. Estimated trajectory (black) and groundtruth (red) in EuRoC different image resolutions and sensors. The mean and two standard deviation ranges are shown for each thread task. As these sequences contain one single loop, the full BA and some tasks of the loop closing thread are executed just once and only a single time measurement is reported.The average tracking time per frame is below the inverse of the camera frame-rate for each sequence, meaning that our system is able to work in real-time. As ORB extraction in stereo images is parallelized, it can be seen that extracting 1000 ORB features in the stereo WVGA images of V2 02 is similar to extracting the same amount of features in the single VGA image channel of fr3 office. The number of keyframes in the loop is shown as reference for the times related to loop closing. While the loop in KITTI 07 contains more keyframes, the covisibility graph built for the indoor fr3 office is denser and therefore the loop fusion, pose-graph optimization and full BA tasks are more expensive. The higher density of the covisibility graph makes the local map contain more keyframes and points and therefore local map tracking and local BA are also more expensive. Fig. 5. Estimated trajectory (black) and ground-truth (red) in KITTI 08. Left: monocular ORB-SLAM [1], right: ORB-SLAM2 (stereo). Monocular ORB- SLAM suffers from severe scale drift in this sequence, especially at the turns. In contrast the proposed stereo version is able to estimate the true scale of the trajectory and map without scale drift. B. EuRoC Dataset The recent EuRoC dataset [21] contains 11 stereo sequences recorded from a micro aerial vehicle (MAV) flying around two different rooms and a large industrial environment. The stereo sensor has a ∼11cm baseline and provides WVGA images at 20Hz. The sequences are classified as easy , medium and difficult depending on MAV’s speed, illumination and scene texture. In all sequences the MAV revisits the environment and ORB-SLAM2 is able to reuse its map, closing loops when necessary. Table II shows absolute translation RMSE of ORB-SLAM2 for all sequences, comparing to Stereo LSD- SLAM, for the results provided in [11]. ORB-SLAM2 achieves a localization precision of a few centimeters and is more accurate than Stereo LSD-SLAM. Our tracking get lost in some parts of V2 03 difficult due to severe motion blur. As shown in [22], this sequence can be processed using IMU information. Fig. 6 shows examples of computed trajectories compared to the ground-truth. C. TUM RGB-D Dataset The TUM RGB-D dataset [3] contains indoors sequences from RGB-D sensors grouped in several categories to evaluate object reconstruction and SLAM/odometry methods under dif- ferent texture, illumination and structure conditions. We show results in a subset of sequences where most RGB-D methods are usually evaluated. In Table III we compare our accuracy to the following state-of-the-art methods: ElasticFusion [15], Kintinuous [12], DVO-SLAM [14] and RGB-D SLAM [13]. Our method is the only one based on bundle adjustment and outperforms the other approaches in most sequences. As we already noticed for RGB-D SLAM results in [1], depthmaps for freiburg2 sequences have a 4% scale bias, probably coming from miscalibration, that we have compensated in our runs and could partly explain our significantly better results. Fig. 7 shows the point clouds that result from backprojecting the sensor depth maps from the computed keyframe poses in four sequences. The good definition and the straight contours of desks and posters prove the high accuracy localization of our approach. D. Timing Results V. CONCLUSION In order to complete the evaluation of the proposed system, we present in Table IV timing results in three sequences with We have presented a full SLAM system for monocular, stereo and RGB-D sensors, able to perform relocalization, loop 4002000200400x [m]0100200300400y [m]4002000200400x [m]0100200300400y [m]2.52.01.51.00.50.00.51.01.52.0x [m]2101234y [m]43210123x [m]32101234y [m]202468101214x [m]4202468y [m]505101520x [m]642024681012y [m]
7 Fig. 7. Dense pointcloud reconstructions from estimated keyframe poses and sensor depth maps in TUM RGB-D fr3 office, fr1 room, fr2 desk and fr3 nst. TUM RGB-D DATASET. COMPARISON OF TRANSLATION RMSE (m). TABLE III Sequence fr1/desk fr1/desk2 fr1/room fr2/desk fr2/xyz fr3/office fr3/nst ORB-SLAM2 (RGB-D) 0.016 0.022 0.047 0.009 0.004 0.010 0.019 Elastic- Fusion 0.020 0.048 0.068 0.071 0.011 0.017 0.016 Kintinuous 0.037 0.071 0.075 0.034 0.029 0.030 0.031 DVO RGBD SLAM SLAM 0.021 0.026 0.046 0.043 0.017 0.018 0.035 0.018 0.087 0.057 - - - - closing and reuse its map in real-time on standard CPUs. We focus on building globally consistent maps for reliable and long-term localization in a wide range of environments as demonstrated in the experiments. The proposed localization mode with the relocalization capability of the system yields a very robust, zero-drift, and ligthweight localization method for known environments. This mode can be useful for certain applications, such as tracking the user viewpoint in virtual reality in a well-mapped space. The comparison to the state-of-the-art shows that ORB- SLAM2 achieves in most cases the highest accuracy. In the KITTI visual odometry benchmark ORB-SLAM2 is currently the best stereo SLAM solution. Crucially, compared with the stereo visual odometry methods that have flourished in recent years, ORB-SLAM2 achieves zero-drift localization in already mapped areas. Surprisingly our RGB-D results demonstrate that if the most accurate camera localization is desired, bundle adjust- ment performs better than direct methods or ICP, with the additional advantage of being less computationally expensive, not requiring GPU processing to operate in real-time. We have released the source code of our system, with examples and instructions so that it can be easily used by other researchers. ORB-SLAM2 is to the best of our knowledge the first open-source visual SLAM system that can work either with monocular, stereo and RGB-D inputs. Moreover our source code contains an example of an augmented reality application2 using a monocular camera to show the potential of our solution. Future extensions might to name some exam- ples, non-overlapping multi-camera, fisheye or omnidirectional cameras support, large scale dense fusion, cooperative map- ping or increased motion blur robustness. include, 2https://youtu.be/kPwy8yA4CKM
分享到:
收藏