This paper has been accepted for publication in IEEE Transactions on Robotics.
DOI: 10.1109/TRO.2017.2705103
IEEE Xplore: http://ieeexplore.ieee.org/document/7946260/
c2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any
current or future media, including reprinting /republishing this material for advertising or promotional purposes, creating new
collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other
works.
7
1
0
2
n
u
J
9
1
]
O
R
.
s
c
[
2
v
5
7
4
6
0
.
0
1
6
1
:
v
i
X
r
a
ORB-SLAM2: an Open-Source SLAM System for
Monocular, Stereo and RGB-D Cameras
Ra´ul Mur-Artal and Juan D. Tard´os
1
Abstract—We present ORB-SLAM2 a complete SLAM system
for monocular, stereo and RGB-D cameras, including map reuse,
loop closing and relocalization capabilities. The system works in
real-time on standard CPUs in a wide variety of environments
from small hand-held indoors sequences, to drones flying in
industrial environments and cars driving around a city. Our
back-end based on bundle adjustment with monocular and stereo
observations allows for accurate trajectory estimation with metric
scale. Our system includes a lightweight localization mode that
leverages visual odometry tracks for unmapped regions and
matches to map points that allow for zero-drift localization. The
evaluation on 29 popular public sequences shows that our method
achieves state-of-the-art accuracy, being in most cases the most
accurate SLAM solution. We publish the source code, not only
for the benefit of the SLAM community, but with the aim of
being an out-of-the-box SLAM solution for researchers in other
fields.
I. INTRODUCTION
Simultaneous Localization and Mapping (SLAM) has been
a hot research topic in the last two decades in the Computer Vi-
sion and Robotics communities, and has recently attracted the
attention of high-technological companies. SLAM techniques
build a map of an unknown environment and localize the
sensor in the map with a strong focus on real-time operation.
Among the different sensor modalities, cameras are cheap
and provide rich information of the environment that allows
for robust and accurate place recognition. Therefore Visual
SLAM solutions, where the main sensor is a camera, are of
major interest nowadays. Place recognition is a key module
of a SLAM system to close loops (i.e. detect when the sensor
returns to a mapped area and correct the accumulated error
in exploration) and to relocalize the camera after a tracking
failure, due to occlusion or aggressive motion, or at system
re-initialization.
Visual SLAM can be performed by using just a monocular
camera, which is the cheapest and smallest sensor setup.
However as depth is not observable from just one camera,
the scale of the map and estimated trajectory is unknown.
In addition the system bootstrapping require multi-view or
filtering techniques to produce an initial map as it cannot
be triangulated from the very first frame. Last but not least,
This work was supported by the Spanish government under Project
DPI2015-67275, the Arag´on regional governmnet under Project DGA T04-
FSE and the Ministerio de Educaci´on Scholarship FPU13/04175.
R. Mur-Artal was with the Instituto de Investigaci´on en Ingenier´ıa de
Arag´on (I3A), Universidad de Zaragoza, 50018 Zaragoza, Spain, until January
2017. He is currently with Oculus Research, Redmond, WA 98052 USA (e-
mail: raul.murartal@oculus.com).
J. D. Tard´os is with the Instituto de Investigaci´on en Ingenier´ıa de
Arag´on (I3A), Universidad de Zaragoza, 50018 Zaragoza, Spain (e-mail:
tardos@unizar.es).
(a) Stereo input: trajectory and sparse reconstruction of an urban environment
with multiple loop closures.
(b) RGB-D input: keyframes and dense pointcloud of a room scene with one
loop closure. The pointcloud is rendered by backprojecting the sensor depth
maps from estimated keyframe poses. No fusion is performed.
Fig. 1. ORB-SLAM2 processes stereo and RGB-D inputs to estimate camera
trajectory and build a map of the environment. The system is able to close
loops, relocalize, and reuse its map in real-time on standard CPUs with high
accuracy and robustness.
monocular SLAM suffers from scale drift and may fail if
performing pure rotations in exploration. By using a stereo
or an RGB-D camera all these issues are solved and allows
for the most reliable Visual SLAM solutions.
In this paper we build on our monocular ORB-SLAM [1]
and propose ORB-SLAM2 with the following contributions:
• The first open-source1 SLAM system for monocular,
stereo and RGB-D cameras, including loop closing, relo-
calization and map reuse.
• Our RGB-D results show that by using Bundle Adjust-
ment (BA) we achieve more accuracy than state-of-the-
art methods based on ICP or photometric and depth error
minimization.
• By using close and far stereo points and monocular
observations our stereo results are more accurate than the
state-of-the-art direct stereo SLAM.
• A lightweight localization mode that can effectively reuse
the map with mapping disabled.
Fig. 1 shows examples of ORB-SLAM2 output from stereo
and RGB-D inputs. The stereo case shows the final trajectory
and sparse reconstruction of the sequence 00 from the KITTI
dataset [2]. This is an urban sequence with multiple loop
closures that ORB-SLAM2 was able to successfully detect.
The RGB-D case shows the keyframe poses estimated in
sequence fr1 room from the TUM RGB-D Dataset [3], and
a dense pointcloud, rendered by backprojecting sensor depth
maps from the estimated keyframe poses. Note that our SLAM
does not perform any fusion like KinectFusion [4] or similar,
but the good definition indicates the accuracy of the keyframe
poses. More examples are shown on the attached video.
In the rest of the paper, we discuss related work in Section
II, we describe our system in Section III, then present the
evaluation results in Section IV and end with conclusions in
Section V.
II. RELATED WORK
In this section we discuss related work on stereo and RGB-
D SLAM. Our discussion, as well as the evaluation in Section
IV is focused only on SLAM approaches.
A. Stereo SLAM
A remarkable early stereo SLAM system was the work of
Paz et al. [5]. Based on Conditionally Independent Divide and
Conquer EKF-SLAM it was able to operate in larger environ-
ments than other approaches at that time. Most importantly, it
was the first stereo SLAM exploiting both close and far points
(i.e. points whose depth cannot be reliably estimated due to
little disparity in the stereo camera), using an inverse depth
parametrization [6] for the latter. They empirically showed that
points can be reliably triangulated if their depth is less than
∼40 times the stereo baseline. In this work we follow this
strategy of treating in a different way close and far points, as
explained in Section III-A.
Most modern stereo SLAM systems are keyframe-based
[7] and perform BA optimization in a local area to achieve
scalability. The work of Strasdat et al. [8] performs a joint
optimization of BA (point-pose constraints) in an inner win-
dow of keyframes and pose-graph (pose-pose constraints) in
an outer window. By limiting the size of these windows the
method achieves constant time complexity, at the expense of
not guaranteeing global consistency. The RSLAM of Mei et
1https://github.com/raulmur/ORB SLAM2
2
al. [9] uses a relative representation of landmarks and poses
and performs relative BA in an active area which can be
constrained for constant-time. RSLAM is able to close loops
which allow to expand active areas at both sides of a loop,
but global consistency is not enforced. The recent S-PTAM by
Pire et al. [10] performs local BA, however it lacks large loop
closing. Similar to these approaches we perform BA in a local
set of keyframes so that the complexity is independent of the
map size and we can operate in large environments. However
our goal is to build a globally consistent map. When closing
a loop, our system aligns first both sides, similar to RSLAM,
so that the tracking is able to continue localizing using the
old map and then performs a pose-graph optimization that
minimizes the drift accumulated in the loop, followed by full
BA.
The recent Stereo LSD-SLAM of Engel et al. [11] is a
semi-dense direct approach that minimizes photometric error
in image regions with high gradient. Not relying on features,
the method is expected to be more robust to motion blur or
poorly-textured environments. However as a direct method its
performance can be severely degraded by unmodeled effects
like rolling shutter or non-lambertian reflectance.
B. RGB-D SLAM
One of the earliest and most famed RGB-D SLAM systems
was the KinectFusion of Newcombe et al. [4]. This method
fused all depth data from the sensor into a volumetric dense
model that is used to track the camera pose using ICP. This
system was limited to small workspaces due to its volumetric
representation and the lack of loop closing. Kintinuous by
Whelan et al. [12] was able to operate in large environments
by using a rolling cyclical buffer and included loop closing
using place recognition and pose graph optimization.
Probably the first popular open-source system was the
RGB-D SLAM of Endres et al. [13]. This is a feature-based
system, whose front-end computes frame-to-frame motion by
feature matching and ICP. The back-end performs pose-graph
optimization with loop closure constraints from a heuristic
search. Similarly the back-end of DVO-SLAM by Kerl et al.
[14] optimizes a pose-graph where keyframe-to-keyframe con-
straints are computed from a visual odometry that minimizes
both photometric and depth error. DVO-SLAM also searches
for loop candidates in a heuristic fashion over all previous
frames, instead of relying on place recognition.
The recent ElasticFusion of Whelan et al. [15] builds a
surfel-based map of the environment. This is a map-centric
approach that forget poses and performs loop closing applying
a non-rigid deformation to the map, instead of a standard
pose-graph optimization. The detailed reconstruction and lo-
calization accuracy of this system is impressive, but the current
implementation is limited to room-size maps as the complexity
scales with the number of surfels in the map.
As proposed by Strasdat et al. [8] our ORB-SLAM2 uses
depth information to synthesize a stereo coordinate for ex-
tracted features on the image. This way our system is agnostic
of the input being stereo or RGB-D. Differently to all above
methods our back-end is based on bundle adjustment and
3
(a) System Threads and Modules.
(b) Input pre-processing
Fig. 2. ORB-SLAM2 is composed of three main parallel threads: tracking, local mapping and loop closing, which can create a fourth thread to perform
full BA after a loop closure. The tracking thread pre-processes the stereo or RGB-D input so that the rest of the system operates independently of the input
sensor. Although it is not shown in this figure, ORB-SLAM2 also works with a monocular input as in [1].
builds a globally consistent sparse reconstruction. Therefore
our method is lightweight and works with standard CPUs. Our
goal is long-term and globally consistent localization instead
of building the most detailed dense reconstruction. However
from the highly accurate keyframe poses one could fuse depth
maps and get accurate reconstruction on-the-fly in a local area
or post-process the depth maps from all keyframes after a full
BA and get an accurate 3D model of the whole scene.
III. ORB-SLAM2
ORB-SLAM2 for stereo and RGB-D cameras is built on
our monocular feature-based ORB-SLAM [1], whose main
components are summarized here for reader convenience. A
general overview of the system is shown in Fig. 2. The system
has three main parallel threads: 1) the tracking to localize
the camera with every frame by finding feature matches to
the local map and minimizing the reprojection error applying
motion-only BA, 2) the local mapping to manage the local
map and optimize it, performing local BA, 3) the loop closing
to detect large loops and correct the accumulated drift by
performing a pose-graph optimization. This thread launches
a fourth thread to perform full BA after the pose-graph
optimization, to compute the optimal structure and motion
solution.
The system has embedded a Place Recognition module
based on DBoW2 [16] for relocalization, in case of tracking
failure (e.g. an occlusion) or for reinitialization in an already
mapped scene, and for loop detection. The system maintains
a covisibiliy graph [8] that links any two keyframes observing
common points and a minimum spanning tree connecting
all keyframes. These graph structures allow to retrieve local
windows of keyframes, so that tracking and local mapping
operate locally, allowing to work on large environments, and
serve as structure for the pose-graph optimization performed
when closing a loop.
The system uses the same ORB features [17] for tracking,
mapping and place recognition tasks. These features are robust
to rotation and scale and present a good invariance to camera
auto-gain and auto-exposure, and illumination changes. More-
over they are fast to extract and match allowing for real-time
operation and show good precision/recall performance in bag-
of-word place recognition [18].
In the rest of this section we present how stereo/depth
information is exploited and which elements of the system
are affected. For a detailed description of each system block,
we refer the reader to our monocular publication [1].
A. Monocular, Close Stereo and Far Stereo Keypoints
ORB-SLAM2 as a feature-based method pre-processes the
input to extract features at salient keypoint locations, as shown
in Fig. 2b. The input images are then discarded and all system
operations are based on these features, so that the system is
independent of the sensor being stereo or RGB-D. Our system
handles monocular and stereo keypoints, which are further
classified as close or far.
Stereo keypoints are defined by three coordinates xs =
(uL, vL, uR), being (uL, vL) the coordinates on the left image
and uR the horizontal coordinate in the right image. For stereo
cameras, we extract ORB in both images and for every left
ORB we search for a match in the right image. This can
be done very efficiently assuming stereo rectified images,
so that epipolar lines are horizontal. We then generate the
stereo keypoint with the coordinates of the left ORB and the
horizontal coordinate of the right match, which is subpixel
refined by patch correlation. For RGB-D cameras, we extract
ORB features on the RGB image and, as proposed by Strasdat
et al. [8], for each feature with coordinates (uL, vL) we
transform its depth value d into a virtual right coordinate:
uR = uL − fxb
d
(1)
where fx is the horizontal focal length and b is the baseline
between the structured light projector and the infrared camera,
which we approximate to 8cm for Kinect and Asus Xtion.
The uncertainty of the depth sensor is represented by the
uncertainty of the virtual right coordinate. In this way, features
from stereo and RGB-D input are handled equally by the rest
of the system.
A stereo keypoint is classified as close if its associated depth
is less than 40 times the stereo/RGB-D baseline, as suggested
in [5], otherwise it is classified as far. Close keypoints can
be safely triangulated from one frame as depth is accurately
estimated and provide scale, translation and rotation informa-
tion. On the other hand far points provide accurate rotation
information but weaker scale and translation information. We
triangulate far points when they are supported by multiple
views.
Monocular keypoints are defined by two coordinates xm =
(uL, vL) on the left image and correspond to all those ORB
for which a stereo match could not be found or that have
an invalid depth value in the RGB-D case. These points are
only triangulated from multiple views and do not provide
scale information, but contribute to the rotation and translation
estimation.
B. System Bootstrapping
One of the main benefits of using stereo or RGB-D cameras
is that, by having depth information from just one frame, we
do not need a specific structure from motion initialization as
in the monocular case. At system startup we create a keyframe
with the first frame, set its pose to the origin, and create an
initial map from all stereo keypoints.
C. Bundle Adjustment with Monocular and Stereo Constraints
Our system performs BA to optimize the camera pose in the
tracking thread (motion-only BA), to optimize a local window
of keyframes and points in the local mapping thread (local
BA), and after a loop closure to optimize all keyframes and
points (full BA). We use the Levenberg–Marquardt method
implemented in g2o [19].
Motion-only BA optimizes the camera orientation R ∈
SO(3) and position t ∈ R3, minimizing the reprojection error
between matched 3D points Xi ∈ R3 in world coordinates and
m ∈ R2 or stereo xi
s ∈ R3,
(·), either monocular xi
keypoints xi
with i ∈ X the set of all matches:
RXi + t2
Σ
{R, t} = argmin
R,t
xi
ρ
i∈X
(·) − π(·)
X
=
fx
where ρ is the robust Huber cost function and Σ the covariance
matrix associated to the scale of the keypoint. The projection
functions π(·), monocular πm and rectified stereo πs, are
defined as follows:
πm
Y
Z
X
Z + cx
Y
Z + cy
X
Z + cx
Y
Z + cy
fy
X−b
Z + cx
fx
(3)
where (fx, fy) is the focal length, (cx, cy) is the principal
point and b the baseline, all known from calibration.
Y
Z
, πs
fy
X
=
fx
(2)
4
Local BA optimizes a set of covisible keyframes KL and
all points seen in those keyframes PL. All other keyframes
KF , not in KL, observing points in PL contribute to the cost
function but remain fixed in the optimization. Defining Xk as
the set of matches between points in PL and keypoints in a
keyframe k, the optimization problem is the following:
{Xi, Rl, tl|i ∈ PL, l ∈ KL} = argmin
ρ (Ekj)
2
RkXj + tk
k∈KL∪KF
Xi,Rl,tl
Σ
j∈Xk
xj
Ekj =
(·) − π(·)
(4)
Full BA is the specific case of local BA, where all
keyframes and points in the map are optimized, except the
origin keyframe that is fixed to eliminate the gauge freedom.
D. Loop Closing and Full BA
Loop closing is performed in two steps, firstly a loop has to
be detected and validated, and secondly the loop is corrected
optimizing a pose-graph. In contrast
to monocular ORB-
SLAM, where scale drift may occur [20], the stereo/depth
information makes scale observable and the geometric vali-
dation and pose-graph optimization no longer require dealing
with scale drift and are based on rigid body transformations
instead of similarities.
In ORB-SLAM2 we have incorporated a full BA optimiza-
tion after the pose-graph to achieve the optimal solution. This
optimization might be very costly and therefore we perform it
in a separate thread, allowing the system to continue creating
map and detecting loops. However this brings the challenge
of merging the bundle adjustment output with the current state
of the map. If a new loop is detected while the optimization
is running, we abort the optimization and proceed to close the
loop, which will launch the full BA optimization again. When
the full BA finishes, we need to merge the updated subset
of keyframes and points optimized by the full BA, with the
non-updated keyframes and points that where inserted while
the optimization was running. This is done by propagating
the correction of updated keyframes (i.e. the transformation
from the non-optimized to the optimized pose) to non-updated
keyframes through the spanning tree. Non-updated points
are transformed according to the correction applied to their
reference keyframe.
E. Keyframe Insertion
ORB-SLAM2 follows the policy introduced in monocular
ORB-SLAM of inserting keyframes very often and culling
redundant ones afterwards. The distinction between close and
far stereo points allows us to introduce a new condition
for keyframe insertion, which can be critical in challenging
environments where a big part of the scene is far from the
stereo sensor, as shown in Fig. 3. In such environment we
need to have a sufficient amount of close points to accurately
estimate translation, therefore if the number of tracked close
points drops below τt and the frame could create at least τc
new close stereo points, the system will insert a new keyframe.
We empirically found that τt = 100 and τc = 70 works well
in all our experiments.
5
COMPARISON OF ACCURACY IN THE KITTI DATASET.
TABLE I
Sequence
00
01
02
03
04
05
06
07
08
09
10
ORB-SLAM2 (stereo)
Stereo LSD-SLAM
trel
(%)
0.70
1.39
0.76
0.71
0.48
0.40
0.51
0.50
1.05
0.87
0.60
rrel
(deg/100m)
0.25
0.21
0.23
0.18
0.13
0.16
0.15
0.28
0.32
0.27
0.27
tabs
(m)
1.3
10.4
5.7
0.6
0.2
0.8
0.8
0.5
3.6
3.2
1.0
trel
(%)
0.63
2.36
0.79
1.01
0.38
0.64
0.71
0.56
1.11
1.14
0.72
rrel
(deg/100m)
0.26
0.36
0.23
0.28
0.31
0.18
0.18
0.29
0.31
0.25
0.33
tabs
(m)
1.0
9.0
2.6
1.2
0.2
1.5
1.3
0.5
3.9
5.6
1.5
Fig. 4. Estimated trajectory (black) and ground-truth (red) in KITTI 00, 01,
05 and 07.
system outperforms Stereo LSD-SLAM in most sequences,
and achieves in general a relative error lower than 1%. The
sequence 01, see Fig. 3, is the only highway sequence in
the training set and the translation error is slightly worse.
Translation is harder to estimate in this sequence because very
few close points can be tracked, due to high speed and low
frame-rate. However orientation can be accurately estimated,
achieving an error of 0.21 degrees per 100 meters, as there are
many far point that can be long tracked. Fig. 4 shows some
examples of estimated trajectories.
Compared to the monocular results presented in [1], the
proposed stereo version is able to process the sequence 01
where the monocular system failed. In this highway sequence,
see Fig. 3, close points are in view only for a few frames.
The ability of the stereo version to create points from just
one stereo keyframe instead of the delayed initialization of
the monocular, consisting on finding matches between two
keyframes, is critical in this sequence not to lose tracking.
Moreover the stereo system estimates the map and trajectory
with metric scale and does not suffer from scale drift, as seen
in Fig. 5.
Fig. 3. Tracked points in KITTI 01 [2]. Green points have a depth less than
40 times the stereo baseline, while blue points are further away. In this kind of
sequences it is important to insert keyframes often enough so that the amount
of close points allows for accurate translation estimation. Far points contribute
to estimate orientation but provide weak information for translation and scale.
F. Localization Mode
We incorporate a Localization Mode which can be useful
for lightweight long-term localization in well mapped areas,
as long as there are not significant changes in the environment.
In this mode the local mapping and loop closing threads
are deactivated and the camera is continuously localized by
the tracking using relocalization if needed. In this mode the
tracking leverages visual odometry matches and matches to
map points. Visual odometry matches are matches between
ORB in the current frame and 3D points created in the previous
frame from the stereo/depth information. These matches make
the localization robust to unmapped regions, but drift can be
accumulated. Map point matches ensure drift-free localization
to the existing map. This mode is demonstrated in the accom-
panying video.
IV. EVALUATION
We have evaluated ORB-SLAM2 in three popular datasets
and compared to other state-of-the-art SLAM systems, using
always the results published by the original authors and
standard evaluation metrics in the literature. We have run
ORB-SLAM2 in an Intel Core i7-4790 desktop computer with
16Gb RAM. In order to account for the non-deterministic
nature of the multi-threading system, we run each sequence
5 times and show median results for the accuracy of the
estimated trajectory. Our open-source implementation includes
the calibration and instructions to run the system in all these
datasets.
A. KITTI Dataset
The KITTI dataset [2] contains stereo sequences recorded
from a car in urban and highway environments. The stereo
sensor has a ∼54cm baseline and works at 10Hz with a
resolution after rectification of 1240 × 376 pixels. Sequences
00, 02, 05, 06, 07 and 09 contain loops. Our ORB-SLAM2
detects all loops and is able to reuse its map afterwards,
except for sequence 09 where the loop happens in very few
frames at the end of the sequence. Table I shows results in
the 11 training sequences, which have public ground-truth,
compared to the state-of-the-art Stereo LSD-SLAM [11], to
our knowledge the only stereo SLAM showing detailed results
for all sequences. We use two different metrics, the absolute
translation RMSE tabs proposed in [3], and the average relative
translation trel and rotation rrel errors proposed in [2]. Our
3002001000100200300x [m]1000100200300400500y [m]0500100015002000x [m]120010008006004002000200y [m]3002001000100200300x [m]1000100200300400y [m]20015010050050x [m]10050050100150y [m]
EUROC DATASET. COMPARISON OF TRANSLATION RMSE (m).
TABLE II
6
ORB-SLAM2 (stereo)
Stereo LSD-SLAM
0.066
0.074
0.089
-
-
-
-
-
-
-
-
Sequence
V1 01 easy
V1 02 medium
V1 03 difficult
V2 01 easy
V2 02 medium
V2 03 difficult
MH 01 easy
MH 02 easy
MH 03 medium
MH 04 difficult
MH 05 difficult
0.035
0.020
0.048
0.037
0.035
X
0.035
0.018
0.028
0.119
0.060
Fig. 6.
V1 02 medium, V2 02 medium, MH 03 medium and MH 05 difficutlt.
Estimated trajectory (black) and groundtruth (red)
in EuRoC
different image resolutions and sensors. The mean and two
standard deviation ranges are shown for each thread task.
As these sequences contain one single loop, the full BA and
some tasks of the loop closing thread are executed just once
and only a single time measurement is reported.The average
tracking time per frame is below the inverse of the camera
frame-rate for each sequence, meaning that our system is able
to work in real-time. As ORB extraction in stereo images is
parallelized, it can be seen that extracting 1000 ORB features
in the stereo WVGA images of V2 02 is similar to extracting
the same amount of features in the single VGA image channel
of fr3 office.
The number of keyframes in the loop is shown as reference
for the times related to loop closing. While the loop in KITTI
07 contains more keyframes, the covisibility graph built for
the indoor fr3 office is denser and therefore the loop fusion,
pose-graph optimization and full BA tasks are more expensive.
The higher density of the covisibility graph makes the local
map contain more keyframes and points and therefore local
map tracking and local BA are also more expensive.
Fig. 5. Estimated trajectory (black) and ground-truth (red) in KITTI 08. Left:
monocular ORB-SLAM [1], right: ORB-SLAM2 (stereo). Monocular ORB-
SLAM suffers from severe scale drift in this sequence, especially at the turns.
In contrast the proposed stereo version is able to estimate the true scale of
the trajectory and map without scale drift.
B. EuRoC Dataset
The recent EuRoC dataset [21] contains 11 stereo sequences
recorded from a micro aerial vehicle (MAV) flying around two
different rooms and a large industrial environment. The stereo
sensor has a ∼11cm baseline and provides WVGA images
at 20Hz. The sequences are classified as easy , medium and
difficult depending on MAV’s speed, illumination and scene
texture. In all sequences the MAV revisits the environment
and ORB-SLAM2 is able to reuse its map, closing loops
when necessary. Table II shows absolute translation RMSE
of ORB-SLAM2 for all sequences, comparing to Stereo LSD-
SLAM, for the results provided in [11]. ORB-SLAM2 achieves
a localization precision of a few centimeters and is more
accurate than Stereo LSD-SLAM. Our tracking get lost in
some parts of V2 03 difficult due to severe motion blur. As
shown in [22], this sequence can be processed using IMU
information. Fig. 6 shows examples of computed trajectories
compared to the ground-truth.
C. TUM RGB-D Dataset
The TUM RGB-D dataset [3] contains indoors sequences
from RGB-D sensors grouped in several categories to evaluate
object reconstruction and SLAM/odometry methods under dif-
ferent texture, illumination and structure conditions. We show
results in a subset of sequences where most RGB-D methods
are usually evaluated. In Table III we compare our accuracy
to the following state-of-the-art methods: ElasticFusion [15],
Kintinuous [12], DVO-SLAM [14] and RGB-D SLAM [13].
Our method is the only one based on bundle adjustment and
outperforms the other approaches in most sequences. As we
already noticed for RGB-D SLAM results in [1], depthmaps
for freiburg2 sequences have a 4% scale bias, probably coming
from miscalibration, that we have compensated in our runs
and could partly explain our significantly better results. Fig.
7 shows the point clouds that result from backprojecting the
sensor depth maps from the computed keyframe poses in four
sequences. The good definition and the straight contours of
desks and posters prove the high accuracy localization of our
approach.
D. Timing Results
V. CONCLUSION
In order to complete the evaluation of the proposed system,
we present in Table IV timing results in three sequences with
We have presented a full SLAM system for monocular,
stereo and RGB-D sensors, able to perform relocalization, loop
4002000200400x [m]0100200300400y [m]4002000200400x [m]0100200300400y [m]2.52.01.51.00.50.00.51.01.52.0x [m]2101234y [m]43210123x [m]32101234y [m]202468101214x [m]4202468y [m]505101520x [m]642024681012y [m]
7
Fig. 7. Dense pointcloud reconstructions from estimated keyframe poses and sensor depth maps in TUM RGB-D fr3 office, fr1 room, fr2 desk and fr3 nst.
TUM RGB-D DATASET. COMPARISON OF TRANSLATION RMSE (m).
TABLE III
Sequence
fr1/desk
fr1/desk2
fr1/room
fr2/desk
fr2/xyz
fr3/office
fr3/nst
ORB-SLAM2
(RGB-D)
0.016
0.022
0.047
0.009
0.004
0.010
0.019
Elastic-
Fusion
0.020
0.048
0.068
0.071
0.011
0.017
0.016
Kintinuous
0.037
0.071
0.075
0.034
0.029
0.030
0.031
DVO
RGBD
SLAM SLAM
0.021
0.026
0.046
0.043
0.017
0.018
0.035
0.018
0.087
0.057
-
-
-
-
closing and reuse its map in real-time on standard CPUs. We
focus on building globally consistent maps for reliable and
long-term localization in a wide range of environments as
demonstrated in the experiments. The proposed localization
mode with the relocalization capability of the system yields
a very robust, zero-drift, and ligthweight localization method
for known environments. This mode can be useful for certain
applications, such as tracking the user viewpoint in virtual
reality in a well-mapped space.
The comparison to the state-of-the-art shows that ORB-
SLAM2 achieves in most cases the highest accuracy. In the
KITTI visual odometry benchmark ORB-SLAM2 is currently
the best stereo SLAM solution. Crucially, compared with the
stereo visual odometry methods that have flourished in recent
years, ORB-SLAM2 achieves zero-drift localization in already
mapped areas.
Surprisingly our RGB-D results demonstrate that
if the
most accurate camera localization is desired, bundle adjust-
ment performs better than direct methods or ICP, with the
additional advantage of being less computationally expensive,
not requiring GPU processing to operate in real-time.
We have released the source code of our system, with
examples and instructions so that it can be easily used by other
researchers. ORB-SLAM2 is to the best of our knowledge
the first open-source visual SLAM system that can work
either with monocular, stereo and RGB-D inputs. Moreover
our source code contains an example of an augmented reality
application2 using a monocular camera to show the potential
of our solution.
Future extensions might
to name some exam-
ples, non-overlapping multi-camera, fisheye or omnidirectional
cameras support, large scale dense fusion, cooperative map-
ping or increased motion blur robustness.
include,
2https://youtu.be/kPwy8yA4CKM