logo资料库

Monocular Visual-Inertial State Estimation With Online Initializ....pdf

第1页 / 共13页
第2页 / 共13页
第3页 / 共13页
第4页 / 共13页
第5页 / 共13页
第6页 / 共13页
第7页 / 共13页
第8页 / 共13页
资料共13页,剩余部分请下载后查看
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING 1 Monocular Visual–Inertial State Estimation With Online Initialization and Camera–IMU Extrinsic Calibration Zhenfei Yang, Student Member, IEEE, and Shaojie Shen, Member, IEEE Abstract— There have been increasing demands for developing microaerial vehicles with vision-based autonomy for search and rescue missions in complex environments. In particular, the monocular visual–inertial system (VINS), which consists of only an inertial measurement unit (IMU) and a camera, forms a great lightweight sensor suite due to its low weight and small footprint. In this paper, we address two challenges for rapid deployment of monocular VINS: 1) the initialization problem and 2) the calibration problem. We propose a methodology that is able to initialize velocity, gravity, visual scale, and camera–IMU extrinsic calibration on the fly. Our approach operates in natural environments and does not use any artificial markers. It also does not require any prior knowledge about the mechanical configuration of the system. It is a significant step toward plug- and-play and highly customizable visual navigation for mobile robots. We show through online experiments that our method leads to accurate calibration of camera–IMU transformation, with errors less than 0.02 m in translation and 1° in rotation. We compare out method with a state-of-the-art marker-based offline calibration method and show superior results. We also demonstrate the performance of the proposed approach in large- scale indoor and outdoor experiments. Note to Practitioners—This paper presents a methodology for online state estimation in natural environments using only a cam- era and a low-cost micro-electro-mechanical systems (MEMS) IMU. It focuses on addressing the problems of online estimator initialization, sensor extrinsic calibration, and nonlinear opti- mization with online refinement of calibration parameters. This paper is particularly useful for applications that have superior size, weight, and power constraints. It aims for rapid deployment of robot platforms with robust state estimation capabilities with Manuscript received January 3, 2016; accepted March 10, 2016. This paper was recommended for publication by Associate Editor A. M. Hsieh and Editor Y. Sun upon evaluation of the reviewer’s comments. This work was supported in part by the Hong Kong University of Science and Technology under Grant R9341. This paper is a revised and extended version of “Monocular Visual-Inertial Fusion with Online Initialization and Camera-IMU Calibration” presented at the IEEE International Symposium on Safety, Security, and Rescue Robotics, West Lafayette, IN, USA, October 2015. The authors are with the Department of Electronic and Computer Engineer- ing, Hong Kong University of Science and Technology, Hong Kong (e-mail: zyangag@connect.ust.hk; eeshaojie@ust.hk). This paper has supplementary downloadable multimedia material available at http://ieeexplore.ieee.org provided by the authors. The Supplementary Material contains the following. Three experiments are presented in the video to demonstrate the performance of our self-calibrating monocular visual- inertial state estimation method. The first experiment details the camera- IMU extrinsic calibration process in a small indoor experiment. The second experiment evaluates the performance of the overall system in a large-scale indoor environment with highlights to the online calibration process. The third experiment presents the state estimation results in a large-scale outdoor environment using different camera configurations. This material is 52.6 MB in size. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASE.2016.2550621 almost no setup, calibration, or initialization overhead. The proposed method can be used in platforms including handheld devices, aerial robots, and other small-scale mobile platforms, with applications in monitoring, inspection, and search and rescue. Index Terms— Calibration, estimator initialization, sensor visual–inertial fusion, systems (VINSs). state estimation, visual navigation, I. INTRODUCTION THERE have been increasing demands for developing high maneuverability robots, such as microaerial vehi- cles (MAVs) with vision-based autonomy for search and rescue missions in confined environments. Such robots and sensor suites should be miniaturized, rapidly deployable, and require minimum maintenance even in hazardous environments. The monocular visual–inertial system (VINS), which consists of only a low-cost MEMS inertial measurement unit (IMU) and a camera, has become a very attractive sensor choice due to its superior size, weight, and power characteristics. In fact, the monocular VINS is the minimum sensor suite that allows both accurate state estimation and sufficient environment awareness. However, the algorithmic challenges for processing infor- mation from monocular VINS are significantly more involved than stereo- [2] or RGB-D-based [3] configurations due to the lack of direct observation of visual scale. The performance of state-of-the-art nonlinear monocular VINS estimators [4]–[7] relies heavily on the accuracy of initial values (velocity, attitude, and visual scale) and the calibration of camera–IMU transformation. In time-critical search and rescue missions, careful initialization (launch) of the robot platform or explicit calibration by professional users is often infeasible. In fact, it is desirable to simply plug sensors onto the MAV, throw it into the air, and have everything operational. This implies that the initialization and the camera–IMU extrinsic calibration procedure should be performed with no prior knowledge about either the dynamical motion or mechanical configuration of the system. Our earlier works [7]–[9] focused on initialization and tightly coupled fusion of monocular VINSs, but with the assumption of known camera–IMU transformation. In this paper, we relax this assumption and propose a method for joint initialization and extrinsic calibration without any prior knowledge about the mechanical configuration of the system. This paper is a revised version of [10]. Several signifi- 1545-5955 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING II. RELATED WORK There is a rich body of scholarly work on VINS state estimation with either monocular [4], [5], [11], stereo [6], or RGB-D cameras [3]. We can categorize solutions to VINS as filtering based [4], [5], [11]–[14] or graph optimization/bundle adjustment based [6], [7], [15]. Filtering-based approaches may require fewer computational resources due to the mar- ginalization of past states, but the early fix of linearization points may lead to suboptimal results. On the other hand, graph optimization-based approaches may improve performance via iterative relinearization at the expense of higher computational demands. In real-world applications, marginalization [16] is usually employed for both filtering- and optimization-based approaches to achieve constant computational complexity. Conditioning is also a popular method within the computa- tion vision community to enforce constant computation [17]. A comparison between filtering- and graph optimization-based approaches results. However, the platform for verification is equipped only with an optical flow sensor that is insufficient for extended feature tracking. This limits the power of graph-based approaches, as a well-connected graph is never constructed. [18] demonstrates nearly identical Another way to categorize VINS solutions is to consider them as loosely coupled [11], [19] or tightly coupled [4]–[6], [12]–[15], [20]. Loosely coupled approaches usually consist of a standalone vision-only state estimation module such as PTAM [17] or SVO [21], and a separate IMU fusion module [19]. These approaches do not consider the visual and inertial information coupling, making them incapable of correcting drifts in the vision-only estimator. Tightly coupled approaches perform systematic fusion of visual and IMU mea- surements and usually lead to better results [6]. In particular, for the monocular VINS, tightly coupled methods are able to implicitly incorporate the metric scale information from IMU into scene depth estimation, thus removing the need for explicit visual scale modeling. However, all the aforementioned VINS solutions rely on accurate initialization of system motions and accurate camera– IMU calibration. This is particularly critical for monocular VINS due to the lack of direct observation of visual scale. There is a wide body of work trying to deal with the velocity, attitude, and visual scale initialization problems for monoc- ular VINS. Pioneering work in this area was proposed by Lupton and Sukkarieh [22], who performed the estimation in the body frame of the first pose in the sliding window. An IMU preintegration technique is proposed to handle multi- rate sensor measurements. They showed that the nonlinearity of the system mainly arises only from rotation drift. Recent results suggest that by assuming the orientation is known, VINS may be solved in a linear closed form [23]–[27]. It has been shown that both the initial gravity vector and the body frame velocity can be estimated linearly. These results have the significant implication that a good initialization of the VINS problem may actually be unnecessary. In particular, [25] and [26] analytically show conditions from which initial values are solvable. However, [23] is limited to using a fixed small number of IMU measurements, which makes it very sensitive to IMU noise. Approaches that utilize multiple IMU Fig. 1. Our monocular VINS setup with unknown camera–IMU extrinsic calibration. Colored coordinate axes are plotted (red: x-axis, green: y-axis, and blue: z-axis). Note that there are multiple 90° offsets between the camera frame and the IMU frame. The rotation offsets are unknown to the estimator and have to be calibrated online. cant extensions have been made. First, we provide a more thorough discussion of the full monocular VINS pipeline, with additional details on the formulation of IMU preinte- gration (Section IV) and nonlinear VINS with calibration refinement (Section VI). Second, we compare our approach against Kalibr, a state-of-the-art marker-based offline calibra- tion method [1] (Section VIII-B) and show superior results. We show through online experiments that our method leads to accurate calibration of camera–IMU transformation with errors of 0.02 m in translation and 1° in rotation. Finally, we present large-scale outdoor experiments performed onboard a quadro- tor aerial vehicle with different sensor placements and analyze the results (Section VIII-D). Our approach is a substantial step toward minimum sensing for plug-and-play robotic systems. We identify the contributions of this paper as fourfold. • A probabilistic optimization-based initialization proce- dure that is able to recover all the essential navigation quantities (initial velocity and attitude, visual scale, and camera–IMU calibration) without any prior system knowledge or artificial calibration objects. • Principal criteria to identify convergence and termination • A tightly coupled monocular visual–inertial fusion methodology for accurate state estimation with online calibration refinement. • Online experiments in complex indoor and outdoor for the initialization procedure. environments. The rest of this paper is organized as follows. In Section II, we discuss the relevant literature. We give an overview of the complete system pipeline in Section III. In Section IV, we present IMU preintegration, a core technique for summarizing and utilizing high-rate IMU measurements without knowing the initial attitude or velocity. We detail our linear initialization and camera–IMU calibration procedure in Section V. A tightly coupled, self-calibrating, nonlinear optimization-based monoc- ular VINS estimator, which is built on top of [7] and [8], is presented in Section VI. We present a two-way marginalization scheme for handling of degenerate motions in Section VII and discuss implementation details and present experimental results in Section VIII. Finally, this paper is concluded with a discussion of possible future research in Section IX.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. YANG AND SHEN: MONOCULAR VISUAL–INERTIAL STATE ESTIMATION 3 Fig. 2. Block diagram illustrating the full pipeline of the proposed approach. Each module is discussed in the corresponding sections marked in the diagram. Fig. 3. Example of sliding window with five IMU states xk and two features fl. Dotted lines represent the pre-integrated IMU measurements and visual measurements. Note that there is a constant but unknown camera–IMU extrinsic calibration xb c . All aforementioned quantities are jointly estimated in our framework. This sliding window model applies to both linear initialization (Section V) and nonlinear optimization (Section VI). measurements in a sliding window [24]–[27] do not scale well to a large number of IMU measurements since they rely on double integration of accelerometer output over an extended period of time. Moreover, these closed-form approaches do not take the noise characteristic of the system into account, producing suboptimal results. and Mourikis camera–IMU calibration, Li [19], and Heng [28] consider [5], For Weiss incorporating the camera–IMU transformation into the state vector for the non- linear estimator. However, the convergence of the calibration parameters still depends on the accuracy of the initial values, and the calibration performance is not systematically analyzed in any of these papers. Our work is related to [24], as both aim to jointly initialize the motion of the system as well as the camera–IMU calibration. However, [24] is a geometric method without consideration of sensor noise. In particular, the formulation of IMU measurements in [24] results in unbounded IMU error over time, which leads to downgraded performance as more IMU measurements are incorporated. On the other hand, our probabilistic formulation explicitly bounds the error for each measurement using a sliding window approach and is thus able to fuse an extensive number of sensor measurements until good initial values are obtained. Also, [24] only shows results with simulated data, while we present extensive experimental results with real sensor data. III. OVERVIEW Our proposed monocular VINS estimator consists of three phases, as illustrated in Fig. 2. The first phase initializes the rotation between the camera and the IMU in a linear fashion (Section V-A). The second phase handles on-the-fly initial- ization of velocity, attitude, visual scale, and camera–IMU translation with a probabilistic linear sliding window approach (Section V-B), as illustrated in Fig. 3. This phase is an extension of [8] and [9] by relaxing the known camera–IMU calibration assumption. Finally, the third phase that focuses on high-accuracy nonlinear graph optimization (Fig. 3) for both state estimation and calibration refinement will be detailed in Section VI. Note that the three phases run sequentially and continuously with automatic switching. This suggests that all the user needs to do is to move the monocular VINS sensor suite freely with sufficient motion in natural environments. Our estimator is able to automatically identify convergence and switch to the next phase (Sections V-A2 and V-B5). Both linear and nonlinear estimators utilize a two-way marginalization scheme (Section VII), which was first proposed in [9], for handling of degenerate motion. We begin by defining notations. We consider (·)w as the earth’s inertial frame, (·)b as the current IMU body frame, and (·)ck as the camera body frame while taking the kth image. We further note (·)bk as the IMU body frame while the camera is taking the kth image. Note that IMU usually runs at a higher rate than the camera, and that multiple IMU measurements may exist in the interval [k, k+1]. pX Y , and R X Y are the 3-D position, velocity, and rotation of frame Y with respect to frame X. Specially, pX represents the position t of the IMU body frame at time t with respect to frame X. A similar convention is followed for other parameters. The Y , vX
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING c and Rb camera–IMU transformation is an unknown constant that we denote as pb c . Besides rotation matrices, we also use quaternions (q = [qx , qy, qz, qw]) to represent rotation. The Hamilton notation is used for quaternions. gw = [0, 0, g]T is the gravity vector in the world frame, and gbk is the earth’s gravity vector expressed in the IMU body frame during the kth image capture. In addition, we use ˆ(·) for noisy sensor measurements and preintegrated IMU measurements. IV. IMU PREINTEGRATION The IMU preintegration technique was first proposed in [22] and is advanced to consider on-manifold uncertainty propaga- tion in [7] and [8]. Further improvements to incorporate IMU biases and integrate with full SLAM framework are proposed in [20]. Here, we give an overview of its motivation and usage within our monocular VINS framework. Given two time instants that correspond to two images, we can write the IMU propagation model for position and velocity in the earth’s inertial frame as follows: t + pw bk+1 vw bk+1 bk = pw = vw bk bk + vw + t∈[k,k+1] t∈[k,k+1] − gw Rw t ab t Rw t ab t dt dt2 − gw (1) where ab is the instantaneous linear acceleration in the IMU body frame and t is the time difference [k, k + 1] between t two image captures. It can be seen that the rotation between the world frame and the body frame is required for IMU state propagation. This global rotation can be determined only with known initial attitude, which is hard to obtain in many appli- cations. However, as introduced in [7], if the reference frame of the whole system is changed to the frame of the first image capture b0, and the frame for IMU propagation is changed to bk, we can preintegrate the parts in (1) that are related to linear acceleration a and angular velocity ω as follows: = = = αbk bk+1 βbk bk+1 Rbk bk+1 t dt2 t∈[k,k+1] Rbk t ab t∈[k,k+1] Rbk t ab t∈[k,k+1] Rbk ωb t t t dt × dt (2) t bk t . Rbk t pb0 bk+1 vbk+1 bk+1 Rb0 bk+1 vbk bk − gb0t + Rbk+1 × is the skew-symmetric matrix from ωb where ωb is the incremental rotation from bk to the current time t, which is available through short-term integration of gyroscope measurements. Therefore, (1) can be rewritten as t − gb0t2/2 + Rb0 + Rb0 = pb0 bk = Rbk+1 vbk bk = Rb0 Rbk bk+1 It can be seen that the preintegration parts (2) can be obtained solely with IMU measurements within [k, k + 1]. With this formulation, the dependency on global orientation is removed. Therefore, using the camera–IMU rotation calibra- tion obtained in Section V-A, we are able to formulate the joint problem of monocular VINS initialization and camera–IMU bk+1 bk+1 βbk αbk (3) bk bk bk bk . translation calibration as a linear problem that can be solved without any prior knowledge of the system (Section V-B). This preintegration technique also enables summarizing mul- tiple IMU measurements into a standalone measurement term. This new measurement term can be relinearized during the iterative nonlinear optimization (Section VI) to achieve better accuracy. This is in contrast to the existing graph-based VINS approach [6] that performs IMU integration in the global frame. Detailed treatments of uncertainty propagation during IMU preintegration for both linear (Section V-B) and nonlinear (Section VI) estimators are presented in the corresponding sections. V. ESTIMATOR INITIALIZATION AND CAMERA–IMU EXTRINSIC CALIBRATION We now detail our online estimator initialization approach to recover all critical states, including velocity, attitude (gravity vector), depth of features, and camera–IMU extrinsic calibra- tion. Our initialization procedure does not require any prior knowledge about the mechanical configuration of the sensor suite. It also does not require the estimator to be started from stationary, making it particularly useful for dynamically launching aerial robots in a search and rescue setting. The initialization and calibration process can be formulated as solving two sets of linear systems, which will be discussed in Sections V-A and V-B, respectively. A. Linear Initialization of Camera–IMU Rotation The constant camera–IMU rotation offset can be obtained by aligning two rotation sequences from the IMU and the camera. 1) Linear Rotation Calibration: We assume that suffi- cient features can be tracked, and the incremental rotation between two images Rck ck+1 can be estimated using the classic five-point algorithm [29] with RANdom Sample Consen- sus (RANSAC)-based outlier rejection. We further denote the corresponding rotation obtained by integrating gyroscope measurements represented in the IMU frame as Rbk . The following equation holds for any k: = Rb · Rb bk+1 (4) . bk+1 Rbk · Rck ck+1 With a quaternion representation for write (4) as ⊗ qb qbk bk+1 c c c = qb ⇒Q1 ⊗ qck ck+1 qbk bk+1 c − Q2 qck ck+1 · qb c rotation, we can = Qk k+1 · qb c = 0 (5) where Q1(q) = Q2(q) = qwI3 + qx yz× qwI3 − qx yz× −qx yz −qx yz qx yz qw qx yz qw (6) are matrix representations for left and right quaternion multi- plication, qx yz× is the skew-symmetric matrix from the first three elements qx yz of a quaternion, and ⊗ is the quaternion multiplication operator.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. YANG AND SHEN: MONOCULAR VISUAL–INERTIAL STATE ESTIMATION 5 With multiple incremental rotations between pairs of con- secutive images, we are able to construct the overconstrained linear system as⎡ ⎤ · qb c = QN · qb c = 0 (7) ⎢⎢⎢⎢⎢⎢⎣ w0 1 w1 2 ⎥⎥⎥⎥⎥⎥⎦ 2 1 · Q0 · Q1 ... · QN−1 N wN−1 N where N is the index of the latest frame that keeps grow- ing until the rotation calibration is completed. Note that the incremental rotation measurements obtained from the five-point algorithm contain outliers due to wrong correspon- dences or numerical errors under degenerated motions. We use k+1 derived from the robust norm for better outlier weight wk handling. As the rotation calibration runs with incoming mea- surements, we are able to use the previously estimated camera– c as the initial value to weight the residual in a similar fashion to the Huber norm [30]. Specifically, the residual is defined as the angle norm in the angle-axis representation of the residual rotation matrix IMU rotation ˆRb − 1 /2 . (8) ˆRb c Rck The weight is a function of the residual −1 −1 c Rbk bk+1 k+1 r k ck+1 ˆRb = acos tr ⎧⎨ ⎩1, = k+1 wk threshold k+1 r k k+1 r k < threshold , otherwise. (9) If there are no sufficient features for estimating the camera rotation, wk to zero. The solution to the above linear system can be found as the right unit singular vector corresponding to the smallest singular value of QN . is set k+1 2) Termination Criteria: Successful calibration of the camera–IMU rotation Rb c relies on sufficient rotation excita- tion. Under sufficient rotation, the null space of QN should be rank one. However, under degenerate motions in one or more axes, the null space of QN may be larger than one. Therefore, by checking whether the second smallest singular value of QN , σ min2 , is large enough, we have a good indicator of whether R sufficient rotation excitation is achieved. We set a singular value threshold σR. The camera–IMU rotation calibration process terminates if σ min2 > σR. A convergence plot can be found in Fig. 6. R B. Linear Initialization of Velocity, Attitude, Feature Depth, and Camera–IMU Translation Once the camera–IMU rotation is fixed, we can estimate the camera–IMU translation together with an initialization of velocity, attitude, and feature depth, as well as the IMU poses with respect to the initial reference frame, as in (3). 1) Linear Sliding Window Estimator: We use a tightly cou- pled sliding window formulation (Fig. 3) for incorporating a large number of IMU and camera measurements with constant computational complexity. The initialization is done in the IMU frame, with the full state vector defined as (the transpose is ignored for simplicity of presentation) xn, xn+1, . . . xn+N , pb c pb0 bk , vbk bk , gbk , λm , λm+1, . . . λm+M (10) X = xk = ⎧⎨ ⎩ minX where xk is the kth IMU state, the gravity vector gbk deter- mines the attitude (roll and pitch angles), N is the number of IMU states in the sliding window, M is the number of features that have sufficient parallax within the sliding window, n and m are starting indexes of the sliding window, and λl is the depth of the lth feature from its first observation. = [0, 0, 0] is preset. Note that we reuse the sensor pb0 b0 measurements that we used for the camera–IMU rotation calibration in this linear initialization phase, but with Rb c fixed as a constant. We also directly use the incremental (Rbk ) and bk+1 relative (Rb0 ) rotations obtained from short-term integration of gyroscope measurements. As this linear initialization can usually be done in only a few seconds, using the IMU rotation directly will not cause significant drifts. bk+1 The linear initialization is done with maximum likelihood estimation by minimizing the sum of the Mahalanobis norm of all measurement errors from the IMU and the monocular camera within the sliding window r p − H pX2 + (l, j )∈C ˆzc j ˆzbk k∈B X2 − Hc j + l bk+1 − Hbk bk+1 ⎫⎬ ⎭ X2 Pbk bk+1 l bk+1 c j λP l , Hc j l (11) where B is the set of all IMU measurements, C is the set of all observations between any features and any camera poses, and Hbk are corresponding measurement matri- ces. {r p, H p} is the (optional) prior, which will be discussed in Section V-B4. Since incremental and relative rotations are known, (11) can be solved in a noniterative linear fashion. 2) Linear IMU Measurement Model: The linear IMU mea- } between consecutive frames k and surement {ˆzbk bk+1 k + 1 is derived from (3), with the exception that all rotations (Rb0 ) are considered as known. We also introduce bk ⎤ the propagation of body frame gravity vectors ⎥⎥⎦ = Hbk ˆzk k+1 , Hbk bk+1 bk+1 X + nbk bk+1 − vbk − vbk gbk+1 − gbk bk bk t + gbk + gbk t − pb0 bk vbk+1 bk+1 Rbk bk+1 bk+1 t2 2 ⎤ ⎥⎥⎥⎦. (12) = and Rbk bk+1 ⎡ ˆαbk ⎢⎢⎣ bk+1 ˆβ bk bk+1ˆ0 ⎡ ⎢⎢⎢⎣ Rbk pb0 bk+1 b0 Rbk = The covariance Pbk following form: bk+1 Pbk bk+1 of the linear IMU measurement has the = αβPbk bk+1 0 0 gPbk bk+1 . (13)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING k+1 is Given known rotations, the joint covariance matrix αβPk independent of the gravity covariance gPbk bk+1. Consider the continuous-time state-space model of the preintegrated IMU measurements derived from the first two equations in (2) ˙αbk t˙βbk t = 0 0 = Ft · + αbk 0 t Rbk βbk t t + Gt · ab . t I 0 αbk t βbk t ab t (14) bk+1 αβPbk can be calculated recursively using first-order discrete- time propagation within the time interval [k, k+1] with initial covariance Pbk bk αβPbk t+δt = (I + Ft δt) ·αβ Pbk · (I + Ft δt)T = 0 t + (Gt δt) · Qt · (Gt δt)T (15) where Qt is the covariance of the additive Gaussian noise of the accelerometer measurements ˆab t = ab t + ant , ant ∼ N (0, Qt ). (16) l , Hc j l 3) Linear Camera Measurement Model: The linear camera measurement model {ˆzc j } for the observation of the lth feature in the jth image is defined as ˆzc j ⎛ ⎜⎝λl = ˆ0 = Hc j X + λnc j ⎛ ⎛ ⎜⎝ f b0 ⎜⎝ f b −1 = M · f b −1 0 b j ⎛ ⎜⎝ f b ⎡ ⎢⎣uci ⎞ ⎞ ⎟⎠ ⎟⎠ ⎞ ⎞ ⎟⎠ ⎟⎠ ⎤ ⎥⎦ (17) bi c c l l l l vci l 1 where M is defined as M = −1 0 0 −1 ˆuc j ˆvc j l l l , vci l (18) and [uci ] is the noiseless first observation of the lth feature ] is the observation that happened in the ith image. [ˆuc j of the same feature in the jth image. The function f m (.) that n transforms a 3-D point r from frame n to frame m and its inverse function f m n , ˆvc j −1 l l (.) are defined as (r) = Rm · r + pm · r − pm (r) = Rn where all rotation matrices R X f m n −1 f m n m n n n Y are known quantities. λnc j is the additive Gaussian measurement noise for the lin- l ear camera model, with the covariance matrix in the following form: λPc j l = λc j 2 l Pc j l (20) where Pc j is the feature observation noise in the normalized l image plane. Although we can only initialize the unknown λc j l as the average scene depth, we find in practice that the solution is insensitive to the initial value of λc j as long as it is set to l be larger than the actual depth. (19) 4) Solution to the Linear Estimator: The linear cost func- tion (11) can be rearranged into the following form: (p + B + C)X = (bp + bB + bC) T H p, bp = H p (21) where {B, bB} and {C, bC} are information matrices and vectors for IMU and visual measurements, respectively, and T r p} is the (optional) prior. Due to {p = H p the known incremental and relative rotations, the cost is linear with respect to the states, and the system in (21) has a unique solution, even the prior { p, bp} is only used to lock the first = [0, 0, 0]). This suggests that our method is position (pb0 b0 able to recover all quantities in the full state vector, including the camera–IMU translation, without any initial guess of those values. 5) Termination Criteria: The covariance matrix (inverse of  p + B + C −1 naturally tells the the information matrix) uncertainty of the linear initialization estimator. The block that corresponds to the camera–IMU translation in the covariance matrix represents the uncertainty of the calibration parameters. We use the maximum singular value σ max of the block as the convergence indicator and terminate the linear initialization process if σ max < σp, where σp is a threshold. A convergence plot can be found in Fig. 6. p p As computing the matrix inverse is much slower than solving the linear system (29) using Cholesky decomposition, we run the termination check at a slower rate in another thread. This will slightly delay the termination time but will not harm overall performance. After this point, the whole initialization and calibration process is completed. VI. TIGHTLY COUPLED NONLINEAR OPTIMIZATION WITH CALIBRATION REFINEMENT After state initialization and obtaining camera–IMU calibra- tion (Section V), we proceed with a sliding window nonlinear estimator, as illustrated in Fig. 3, for high-accuracy state estimation and calibration refinement. This is an extension of [7] and [8] by including camera–IMU calibration in the nonlinear optimization. Since a large number of parameters in the nonlinear opti- mization share the same physical meaning as those in the linear initialization (Section V), here we introduce a slight abuse of notations by reusing symbols to represent state vectors (X ), Jacobian matrices (H, F, G), covariance matrices (P, Q), and information matrices (). A. Formulation The definition of the full state is similar to the linear case, with the exception that the full 6-DOF camera–IMU transformation xb c is included in the state vector. The gravity vector is also replaced with the quaternion qbi for joint b j optimization of IMU translation and rotation (the transpose is again ignored) X = [xn, xn+1,···x n+N , xb xk = = , vbk , qb0 pb0 bk bk bk , qb pb . c c xb c c , λ0 ···λ m+M] (22)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. YANG AND SHEN: MONOCULAR VISUAL–INERTIAL STATE ESTIMATION 7 We minimize the sum of the Mahalanobis norm of all measurement residuals to obtain a maximum a posteriori estimation minX ⎧⎨ rB ⎩r p − H pX2 + k∈B ,X2 rC ˆzc j + (l, j )∈C ,X ) and rC (ˆzc j c j P l l bk+1 ˆzbk ⎫⎬ ⎭ ,X2 Pbk bk+1 l bk+1 where rB(ˆzbk ,X ) are measurement residuals for the IMU and the camera, respectively. Corresponding measurement models are defined in Sections VI-B and VI-C. We use error-state representation [6] to linearize the nonlin- ear system (23) and solve it with the Gauss–Newton algorithm with the Huber norm [30] for robust outlier rejection. The residuals for linear components such as position, velocity, and feature depth can be easily defined as an addition to the latest state estimates p = ˆp + δp, v = ˆv + δv, λ = ˆλ + δλ. (24) The residual for rotation is instead modeled as the perturbation in the tangent space of the rotation manifold. The error quaternion term δq is defined as the small difference between the estimated and the true quaternions δq ≈ (25) where ⊗ is the quaternion multiplication operator. The 3-D error vector δθ is the minimal presentation of rotation residual. We also have the equivalent formulation in the form of rotation matrix to provide a simple linearization form for the rotation residual q = ˆq ⊗ δq, δθ 1 1 2 R ≈ ˆR · (I + δθ×). The full error-state vector then becomes δX = δxk = = δxb c , δλm . . . δλm+M δxn, . . . δxn+N , δxb c , δvbk δpb0 bk bk , δθ b δpb . c c , δθ b0 bk (26) (27) In each Gauss–Newton iteration, (23) is linearized at current state estimation ˆX with respect to the error state δX the min δX ⎧⎨ ⎩r p − H pX2 + ˆzc j rC (l, j )∈C + l k∈B rB ˆzbk , ˆX + Hbk δX2 , ˆX + Hc j bk+1 ⎫⎬ ⎭ l c j P l δX2 bk+1 Pbk bk+1 (28) bk+1 and Hc j l where Hbk are Jacobians of IMU and visual measurements, respectively. Minimizing the linearized cost function (28) is equivalent to solving the following linear system: ( p + B + C)δX = (bp + bB + bC) where  p, B, and C are information matrices from the prior, IMU, and visual measurements, respectively. Incremen- T r p} is tal construction of the prior {p = H p discussed in Section VII. Note that (29) is different from (21) in the sense that (29) solves for the increments for the nonlinear optimization, while (21) directly recovers the initial values. T H p, bp = H p (23) The error state is updated as ˆX = ˆX ⊕ δX (30) where ⊕ is the compound operator that has the form of simple addition for position, velocity, and feature depth as in (24), but is formulated as quaternion multiplication for rotations as in (25). B. IMU Measurement Model ment is Following (3), the residual of a preintegrated IMU measure- ⎡ ⎢⎣δαbk ⎤ ⎥⎦ 2 bk bk x yz rB = (31) −1 0 bk ⎤ ⎥⎥⎥⎦ ,X = t − ˆαbk bk+1 − ˆβ bk bk+1 − vbk − vbk ⊗ qb0 bk+1 ˆzbk bk+1 δβbk bk+1 bk+1 δθ bk ⎡ bk+1 − pb0 + gb0t2/2 ⎢⎢⎢⎣ Rbk pb0 bk+1 b0 bk + gb0t vbk+1 Rbk Rb0 ˆqb bk+1 bk+1 b0 −1 ⊗ qb k bk+1 where [q]x yz extracts the vector part of the quaternion q, which forms the rotation error vector as in (25), and ]T is the preintegrated IMU measurement [ˆαbk bk+1 obtained following (2) without knowing the initial velocity and attitude, using only noisy accelerometer and gyroscope measurements ∼ N (0, Qt ). (32) Taking the derivative of the residual rB(ˆzbk ,X ) with bk+1 respect to error-state vector δX gives the Jacobian of IMU measurements Hbk ant ωnt ˆab tˆωb −1 k bk+1 in (28). ant ωnt , ˆqb ab t ωb t bk bk+1 , ˆβ = + , t t bk+1 To obtain the covariance matrix Pbk of the preintegrated IMU measurement, we form the linearized continuous dynam- ics of the error state of the IMU measurement using (2) ⎡ and (26) δ˙αbk ⎢⎢⎣ δ ˙βbk δ˙θ bk ⎡ ⎢⎣δαbk ⎤ ⎥⎥⎦ = ⎤ ⎥⎦ t δβbk t δθ bk t ⎤ ⎡ × ˆab ⎥⎦ ⎢⎣0 0− ˆRbk −ˆωb × 0 0 0 ⎡ ⎤ ⎣ 0 ⎦ 0 − ˆRbk ant 0 −I ωnt 0 + = Ft · δzbk t + Gt · nt . 0 I t t t t t t bk+1 (33) (29) The covariance matrix Pbk can then be calculated by first-order discrete-time propagation within the time interval bk+1
ˆzc j l ,X = rC = f c j l ⎤ l − ˆu j ⎥⎥⎥⎥⎦ − ˆv j ⎤ ⎥⎥⎦= f b c l −1 ⎡ ⎢⎢⎢⎢⎣ ⎡ ⎢⎢⎣ f x c j l f zc j l f yc j l f zc j l f x c j l f yc j l f zc j l ⎛ ⎜⎝f b −1 0 b j ⎛ ⎜⎝f b0 bi · f b c ⎞ ⎤ ⎟⎠ ⎥⎦ ⎞ ⎞ ⎟⎠ ⎟⎠ ⎛ ⎜⎝λl ⎡ ⎢⎣uci l vci l 1 (35) This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING [k, k + 1], with initial covariance Pbk Pbk t+δt = (I+ Ft δt) · Pbk bk t = 0 · (I+ Ft δt)T + (Gt δt) · Qt · (Gt δt)T . (34) Note how this propagation (33) is different from the dynamics of the linear IMU model (14) as we also consider the error propagation in the rotation component as well as the cross correlation between the rotation and the translation. C. Camera Measurement Model The camera measurement model can be formulated similar to the linear initialization (Section V-B), but with the residual being the reprojection error with covariance matrix Pc j l due to the feature depth initialization presented in Section V-B (.) and its inverse are defined in (19). where f b c The Jacobian of the visual measurement Hc j obtained by taking the derivative of the residual rC(ˆzc j l with respect to the error-state vector δX . l in (28) is ,X ) VII. TWO-WAY MARGINALIZATION In order to bound the computational complexity of graph optimization-based methods, marginalization is usually used. We selectively marginalize out IMU states xk and features λl from the sliding window for both the linear (Section V) and nonlinear (Section VI) optimization. Due to the well- known acceleration excitation requirement [13], [14] for scale observability for monocular VINS, a naive strategy that always marginalizes the oldest state may result in unobservable scale in degenerate motions, such as hovering or constant velocity motions. To avoid this, we employ the two-way marginalization scheme originally proposed in [8] and [9] to selectively remove recent or old IMU states based on a scene parallax test. We add a new IMU state to the sliding window if the time between two IMU states t is larger than a threshold. We do not have a notion of spatial keyframes as in vision-only approaches [17], due to the requirement of bounding the uncertainty for every preintegrated IMU measurements. We then select whether to remove the oldest or the most recent IMU states based on a parallax test. As shown in Fig. 4, a recent state is identified as fixed only if it has sufficient parallax to the previous fixed state; otherwise, it will be removed in the next marginalization. We refer readers to [9] for details of this selection process. Fig. 4. Example with seven IMU states xk and three features fl . (a) Structure of the full state before, during, and after marginalizing a recent IMU state x5 after a newer IMU state x6 is added. Visual and IMU measurements related to IMU state x5 (denoted by transparent edges) are summarized into a new prior p in (36). Also, feature f2 is removed because it has no valid observation (under the assumption that the first observation is noiseless, as described in Section V-B3). (b) Similar marginalization process of the oldest IMU state, where IMU state x0 and feature f0 are removed and all involved measurements are used to construct a new prior. Small parallax: hovering or slow motion in (a). Large parallax: fast motion in (b). We construct a new prior based on all measurements related to the removed states  p =  p + Hbk Hc j l Pc j −1 l Hc j −1 Pbk bk+1 bk+1 Hbk bk+1 + l (l, j )∈C− k∈B− and C− where B− are sets of removed IMU and camera measurements, respectively. The marginalization is carried out using the Schur complement [16]. (36) Intuitively, the two-way marginalization keeps removing recent IMU states if the platform has small or no motion. Keeping older IMU states in this case will preserve accel- eration information that is necessary to recover the visual scale. The camera–IMU calibration also benefits from this marginalization scheme because it naturally accumulates all measurements to refine the calibration parameters. VIII. EXPERIMENTAL RESULTS A. Implementation Details As shown in Fig. 1, our monocular VINS sensor suite consists of an mvBlueFOX-MLC200w grayscale camera with
分享到:
收藏