logo资料库

Structure-from-Motion Revisited.pdf

第1页 / 共10页
第2页 / 共10页
第3页 / 共10页
第4页 / 共10页
第5页 / 共10页
第6页 / 共10页
第7页 / 共10页
第8页 / 共10页
资料共10页,剩余部分请下载后查看
2016 IEEE Conference on Computer Vision and Pattern Recognition Structure-from-Motion Revisited Johannes L. Sch¨onberger1,2∗, Jan-Michael Frahm1 1University of North Carolina at Chapel Hill 2Eidgen¨ossische Technische Hochschule Z¨urich jsch@inf.ethz.ch, jmf@cs.unc.edu Abstract Incremental Structure-from-Motion is a prevalent strat- egy for 3D reconstruction from unordered image collec- tions. While incremental reconstruction systems have tremendously advanced in all regards, robustness, accu- racy, completeness, and scalability remain the key problems towards building a truly general-purpose pipeline. We pro- pose a new SfM technique that improves upon the state of the art to make a further step towards this ultimate goal. The full reconstruction pipeline is released to the public as an open-source implementation. 1. Introduction Structure-from-Motion (SfM) from unordered images has seen tremendous evolution over the years. The early self-calibrating metric reconstruction systems [42, 6, 19, 16, 46] served as the foundation for the first systems on unordered Internet photo collections [47, 53] and urban scenes [45]. Inspired by these works, increasingly large- scale reconstruction systems have been developed for hun- dreds of thousands [1] and millions [20, 62, 51, 50] to re- cently a hundred million Internet photos [30]. A variety of SfM strategies have been proposed including incremen- tal [53, 1, 20, 62], hierarchical [23], and global approaches [14, 61, 56]. Arguably, incremental SfM is the most popular strategy for reconstruction of unordered photo collections. Despite its widespread use, we still have not accomplished to design a truly general-purpose SfM system. While the existing systems have advanced the state of the art tremen- dously, robustness, accuracy, completeness, and scalability remain the key problems in incremental SfM that prevent its use as a general-purpose method. In this paper, we propose a new SfM algorithm to approach this ultimate goal. The new method is evaluated on a variety of challenging datasets and the code is contributed to the research community as an open-source implementation named COLMAP available at https://github.com/colmap/colmap. ∗This work was done at the University of North Carolina at Chapel Hill. Figure 1. Result of Rome with 21K registered out of 75K images. 2. Review of Structure-from-Motion SfM is the process of reconstructing 3D structure from its projections into a series of images taken from different viewpoints. Incremental SfM (denoted as SfM in this paper) is a sequential processing pipeline with an iterative recon- struction component (Fig. 2). It commonly starts with fea- ture extraction and matching, followed by geometric verifi- cation. The resulting scene graph serves as the foundation for the reconstruction stage, which seeds the model with a carefully selected two-view reconstruction, before incre- mentally registering new images, triangulating scene points, filtering outliers, and refining the reconstruction using bun- dle adjustment (BA). The following sections elaborate on this process, define the notation used throughout the paper, and introduce related work. 2.1. Correspondence Search The first stage is correspondence search which finds scene overlap in the input images I = {Ii | i = 1...NI} and identifies projections of the same points in overlapping images. The output is a set of geometrically verified image pairs ¯C and a graph of image projections for each point. Feature Extraction. For each image Ii, SfM detects sets Fi = {(xj, fj) | j = 1...NFi } of local features at loca- tion xj ∈ R 2 represented by an appearance descriptor fj. The features should be invariant under radiometric and ge- ometric changes so that SfM can uniquely recognize them in multiple images [41]. SIFT [39], its derivatives [59], and more recently learned features [9] are the gold standard in terms of robustness. Alternatively, binary features provide better efficiency at the cost of reduced robustness [29]. 1063-6919/16 $31.00 © 2016 IEEE DOI 10.1109/CVPR.2016.445 4104
Images Correspondence Search Incremental Reconstruction Reconstruction Feature Extraction Initialization Matching Image Registration Outlier Filtering Geometric Verification Triangulation Bundle Adjustment Figure 2. Incremental Structure-from-Motion pipeline. N 2 Fi Matching. Next, SfM discovers images that see the same scene part by leveraging the features Fi as an appearance description of the images. The na¨ıve approach tests every image pair for scene overlap; it searches for feature cor- respondences by finding the most similar feature in image Ia for every feature in image Ib, using a similarity met- ric comparing the appearance fj of the features. This ap- proach has computational complexity O(N 2 ) and is I prohibitive for large image collections. A variety of ap- proaches tackle the problem of scalable and efficient match- ing [1, 20, 37, 62, 28, 49, 30]. The output is a set of poten- tially overlapping image pairs C = {{Ia, Ib} | Ia, Ib ∈ I, a < b} and their associated feature correspondences Mab ∈ Fa × Fb. Geometric Verification. The third stage verifies the po- tentially overlapping image pairs C. Since matching is based solely on appearance, it is not guaranteed that cor- responding features actually map to the same scene point. Therefore, SfM verifies the matches by trying to estimate a transformation that maps feature points between images us- ing projective geometry. Depending on the spatial config- uration of an image pair, different mappings describe their geometric relation. A homography H describes the trans- formation of a purely rotating or a moving camera capturing a planar scene [26]. Epipolar geometry [26] describes the relation for a moving camera through the essential matrix E (calibrated) or the fundamental matrix F (uncalibrated), and can be extended to three views using the trifocal ten- sor [26]. If a valid transformation maps a sufficient number of features between the images, they are considered geo- metrically verified. Since the correspondences from match- ing are often outlier-contaminated, robust estimation tech- niques, such as RANSAC [18], are required. The output of this stage is a set of geometrically verified image pairs ¯C, their associated inlier correspondences ¯Mab, and optionally a description of their geometric relation Gab. To decide on the appropriate relation, decision criterions like GRIC [57] or methods like QDEGSAC [21] can be used. The output of this stage is a so-called scene graph [54, 37, 48, 30] with images as nodes and verified pairs of images as edges. 2.2. Incremental Reconstruction The input to the reconstruction stage is the scene graph. The outputs are pose estimates P = {Pc ∈ SE(3) | c = 1...NP} for registered images and the reconstructed scene structure as a set of points X = {Xk ∈ R 3 | k = 1...NX}. Initialization. SfM initializes the model with a carefully selected two-view reconstruction [7, 52]. Choosing a suit- able initial pair is critical, since the reconstruction may never recover from a bad initialization. Moreover, the ro- bustness, accuracy, and performance of the reconstruction depends on the seed location of the incremental process. Initializing from a dense location in the image graph with many overlapping cameras typically results in a more robust and accurate reconstruction due to increased redundancy. In contrast, initializing from a sparser location results in lower runtimes, since BAs deal with overall sparser problems ac- cumulated over the reconstruction process. Image Registration. Starting from a metric reconstruc- tion, new images can be registered to the current model by solving the Perspective-n-Point (PnP) problem [18] using feature correspondences to triangulated points in already registered images (2D-3D correspondences). The PnP prob- lem involves estimating the pose Pc and, for uncalibrated cameras, its intrinsic parameters. The set P is thus ex- tended by the pose Pc of the newly registered image. Since the 2D-3D correspondences are often outlier-contaminated, the pose for calibrated cameras is usually estimated using RANSAC and a minimal pose solver, e.g. [22, 34]. For un- calibrated cameras, various minimal solvers, e.g. [10], or sampling-based approaches, e.g. [31], exist. We propose a novel robust next best image selection method for accurate pose estimation and reliable triangulation in Sec. 4.2. Triangulation. A newly registered image must observe existing scene points. In addition, it may also increase scene coverage by extending the set of points X through triangu- lation. A new scene point Xk can be triangulated and added to X as soon as at least one more image, also covering the new scene part but from a different viewpoint, is registered. Triangulation is a crucial step in SfM, as it increases the sta- bility of the existing model through redundancy [58] and it enables registration of new images by providing additional 2D-3D correspondences. A large number of methods for multi-view triangulation exist [27, 5, 25, 35, 40, 3, 44, 32]. These methods suffer from limited robustness or high com- putational cost for use in SfM, which we address by propos- ing a robust and efficient triangulation method in Sec. 4.3. Bundle Adjustment. Image registration and triangula- tion are separate procedures, even though their products are highly correlated – uncertainties in the camera pose propa- gate to triangulated points and vice versa, and additional tri- angulations may improve the initial camera pose through in- creased redundancy. Without further refinement, SfM usu- ally drifts quickly to a non-recoverable state. BA [58] is the joint non-linear refinement of camera parameters Pc and point parameters Xk that minimizes the reprojection error E = ρj j π (Pc, Xk) − xj2 2 (1) using a function π that projects scene points to image space 4105
) and a time complexity of O(N 3 P and a loss function ρj to potentially down-weight outliers. Levenberg-Marquardt [58, 26] is the method of choice for solving BA problems. The special structure of parameters in BA problems motivates the Schur complement trick [8], in which one first solves the reduced camera system and then updates the points via back-substitution. This scheme is commonly more efficient, since the number of cameras is usually smaller than the number of points. There are two choices for solving the system: exact and inexact step al- gorithms. Exact methods solve the system by storing and factoring it as a dense or sparse matrix [13, 38] with a space complexity of O(N 2 ). P Inexact methods approximately solve the system, usually by using an iterative solver, e.g. preconditioned conjugate gradients (PCG), which has O(NP ) time and space com- plexity [4, 63]. Direct algorithms are the method of choice for up to a few hundred cameras but they are too expen- sive in large-scale settings. While sparse direct methods re- duce the complexity by a large factor for sparse problems, they are prohibitive for large unstructured photo collections due to typically much denser connectivity graphs [54, 4]. In this case, indirect algorithms are the method of choice. Especially for Internet photos, BA spends significant time on optimizing many near-duplicate images. In Sec. 4.5, we propose a method to identify and parameterize highly over- lapping images for efficient BA of dense collections. 3. Challenges While the current state-of-the-art SfM algorithms can handle the diverse and complex distribution of images in large-scale Internet photo collections, they frequently fail to produce fully satisfactory results in terms of completeness and robustness. Oftentimes, the systems fail to register a large fraction of images that empirically should be registra- ble [20, 30], or the systems produce broken models due to mis-registrations or drift. First, this may be caused by cor- respondence search producing an incomplete scene graph, e.g. due to matching approximations, and therefore provid- ing neither the necessary connectivity for complete models nor sufficient redundancy for reliable estimation. Second, this may be caused by the reconstruction stage failing to register images due to missing or inaccurate scene structure – image registration and triangulation have a symbiotic re- lationship in that images can only be registered to existing scene structure and scene structure can only be triangulated from registered images [64]. Maximizing the accuracy and completeness of both at each step during the incremental reconstruction is a key challenge in SfM. In this paper, we address this challenge and significantly improve results over the current state of the art (Sec. 5) in terms of completeness, robustness, and accuracy while boosting efficiency. 4. Contributions This section presents a new algorithm that improves on the main challenges in SfM. First, we introduce a geo- metric verification strategy that augments the scene graph with information subsequently improving the robustness of the initialization and triangulation components. Second, a next best view selection maximizing the robustness and ac- curacy of the incremental reconstruction process. Third, a robust triangulation method that produces significantly more complete scene structure than the state of the art at reduced computational cost. Fourth, an iterative BA, re- triangulation, and outlier filtering strategy that significantly improves completeness and accuracy by mitigating drift ef- fects. Finally, a more efficient BA parameterization for dense photo collections through redundant view mining. This results in a system that clearly outperforms the cur- rent state of the art in terms of robustness and completeness while preserving its efficiency. We contrast our contribu- tions to the current state-of-the-art systems Bundler (open- source) [52] and VisualSFM (closed-source) [62]. The pro- posed system is released as an open-source implementation. 4.1. Scene Graph Augmentation We propose a multi-model geometric verification strat- egy to augment the scene graph with the appropriate geo- metric relation. First, we estimate a fundamental matrix. If at least NF inliers are found, we consider the image pair as geometrically verified. Next, we classify the transfor- mation by determining the number of homography inliers NH for the same image pair. To approximate model se- lection methods like GRIC, we assume a moving camera in a general scene if NH /NF < HF . For calibrated im- ages, we also estimate an essential matrix and its number of inliers NE. If NE/NF > EF , we assume correct cali- bration. In case of correct calibration and NH /NF < HF , we decompose the essential matrix, triangulate points from inlier correspondences, and determine the median triangu- lation angle αm. Using αm, we distinguish between the case of pure rotation (panoramic) and planar scenes (pla- nar). Furthermore, a frequent problem in Internet photos are watermarks, timestamps, and frames (WTFs) [60, 30] that incorrectly link images of different landmarks. We detect such image pairs by estimating a similarity transformation with NS inliers at the image borders. Any image pair with NS/NF > SF ∨ NS/NE > SE is considered a WTF and not inserted to the scene graph. For valid pairs, we label the scene graph with the model type (general, panoramic, pla- nar) alongside the inliers of the model with maximum sup- port NH, NE, or NF . The model type is leveraged to seed the reconstruction only from non-panoramic and preferably calibrated image pairs. An already augmented scene graph enables to efficiently find an optimal initialization for a ro- bust reconstruction process. In addition, we do not triangu- 4106
Score = 66 Score = 146 Score = 80 Score = 200 Figure 3. Scores for different number of points (left and right) with different distributions (top and bottom) in the image for L = 3. late from panoramic image pairs to avoid degenerate points and thereby improve robustness of triangulation and subse- quent image registrations. 4.2. Next Best View Selection Next best view planning has been studied in the fields of computer vision, photogrammetry, and robotics [12]. Choosing the next best view in robust SfM aims to mini- mize the reconstruction error [17, 24]. Here, we propose an efficient next best view strategy following an uncertainty- driven approach that maximizes reconstruction robustness. Choosing the next best view is critical, as every decision impacts the remaining reconstruction. A single bad deci- sion may lead to a cascade of camera mis-registrations and faulty triangulations. In addition, choosing the next best view greatly impacts both the quality of pose estimates and the completeness and accuracy of triangulation. An accu- rate pose estimate is essential for robust SfM, as point tri- angulations may fail if the pose is inaccurate. The decision about choosing the next best view is challenging, since for Internet photo collections there is usually no a priori infor- mation about scene coverage and camera parameters, and therefore the decision is based entirely on information de- rived from appearance [17], two-view correspondences, and the incrementally reconstructed scene [53, 24]. A popular strategy is to choose the image that sees most triangulated points [52] with the aim of minimizing the un- certainty in camera resection. Haner et al. [24] propose an uncertainty-driven approach that minimizes the reconstruc- tion error. Usually, the camera that sees the largest number of triangulated points is chosen, except when the configu- ration of observations is not well-conditioned. To this end, Lepetit et al. [34] experimentally show that the accuracy of the camera pose using PnP depends on the number of ob- servations and their distribution in the image. For Internet photos, the standard PnP problem is extended to the estima- tion of intrinsic parameters in the case of missing or inac- curate prior calibration. A large number of 2D-3D corre- spondences provides this estimation with redundancy [34], while a uniform distribution of points avoids bad configura- tions and enables reliable estimation of intrinsics [41]. The candidates for the next best view are not the yet registered images that see at least Nt > 0 triangulated points. Keeping track of this statistic can be efficiently implemented using a graph of feature tracks. For Internet datasets, this graph can be very dense, since many images may see the same structure. Hence, there are many candi- date views to choose from at each step in the reconstruction. Exhaustive covariance propagation as proposed by Haner et al. is not feasible, as the covariance would need to be com- puted and analyzed for each candidate at each step. Our proposed method approximates their uncertainty-driven ap- proach using an efficient multi-resolution analysis. We must simultaneously keep track of the number of visible points and their distribution in each candidate im- age. More visible points and a more uniform distribution of these points should result in a higher score S [31], such that images with a better-conditioned configuration of visi- ble points are registered first. To achieve this goal, we dis- cretize the image into a fixed-size grid with Kl bins in both dimensions. Each cell takes two different states: empty and full. Whenever a point within an empty cell becomes vis- ible during the reconstruction, the cell’s state changes to full and the score Si of the image is increased by a weight wl. With this scheme, we quantify the number of visible points. Since cells only contribute to the overall score once, we favor a more uniform distribution over the case when the points are clustered in one part of the image (i.e. only a few cells contain all visible points). However, if the number of visible points is Nt K 2 l , this scheme may not cap- ture the distribution of points well as every point is likely to fall into a separate cell. Consequently, we extend the pre- viously described approach to a multi-resolution pyramid with l = 1...L levels by partitioning the image using higher resolutions Kl = 2l at each successive level. The score is accumulated over all levels with a resolution-dependent weight wl = K 2 l . This data structure and its score can be efficiently updated online. Fig. 3 shows scores for differ- ent configurations, and Sec. 5 demonstrates improved re- construction robustness and accuracy using this strategy. 4.3. Robust and Efficient Triangulation Especially for sparsely matched image collections, ex- ploiting transitive correspondences boosts triangulation completeness and accuracy, and hence improves subsequent image registrations. Approximate matching techniques usu- ally favor image pairs similar in appearance, and as a re- sult two-view correspondences often stem from image pairs with a small baseline. Leveraging transitivity establishes correspondences between images with larger baselines and thus enables more accurate triangulation. Hence, we form feature tracks by concatenating two-view correspondences. A variety of approaches have been proposed for multi- view triangulation from noisy image observations [27, 40, 5]. While some of the proposed methods are robust to a certain degree of outlier contamination [25, 35, 3, 44, 32], 4107
to the best of our knowledge none of the approaches can handle the high outlier ratio often present in feature tracks (Fig. 6). We refer to Kang et al. [32] for a detailed overview of existing multi-view triangulation methods. In this sec- tion, we propose an efficient, sampling-based triangula- tion method that can robustly estimate all points within an outlier-contaminated feature track. Feature tracks often contain a large number of out- liers due to erroneous two-view verification of ambiguous matches along the epipolar line. A single mismatch merges the tracks of two or more independent points. For example, falsely merging four feature tracks with equal length results in an outlier ratio of 75%. In addition, inaccurate camera poses may invalidate track elements due to high reprojec- tion errors. Hence, for robust triangulation, it is necessary to find a consensus set of track elements before performing a refinement using multiple views. Moreover, to recover the potentially multiple points of a feature track from a faulty merge, a recursive triangulation scheme is necessary. Bundler samples all pairwise combinations of track el- ements, performs two-view triangulation, and then checks if at least one solution has a sufficient triangulation angle. If a well-conditioned solution is found, multi-view triangu- lation is performed on the whole track, and it is accepted if all observations pass the cheirality constraint [26]. This method is not robust to outliers, as it is not possible to re- cover independent points merged into one track. Also, it has significant computational cost due to exhaustive pairwise triangulation. Our method overcomes both limitations. To handle arbitrary levels of outlier contamination, we formulate the problem of multi-view triangulation using RANSAC. We consider the feature track T = {Tn | n = 1...NT} as a set of measurements with an a priori un- known ratio  of inliers. A measurement Tn consists of the normalized image observation ¯xn ∈ R 2 and the cor- responding camera pose Pn ∈ SE(3) defining the pro- RT −RT t jection from world to camera frame P = with R ∈ SO(3) and t ∈ R 3. Our objective is to maxi- mize the support of measurements conforming with a well- conditioned two-view triangulation Xab ∼ τ (¯xa, ¯xb, Pa, Pb) with a = b, (2) where τ is any chosen triangulation method (in our case the DLT method [26]) and Xab is the triangulated point. Note, that we do not triangulate from panoramic image pairs (Sec. 4.1) to avoid erroneous triangulation angles due to in- accurate pose estimates. A well-conditioned model satisfies two constraints. First, a sufficient triangulation angle α cos α = ta − Xab ta − Xab2 · tb − Xab tb − Xab2 . (3) Second, positive depths da and db w.r.t. the views Pa and Pb (cheirality constraint), with the depth being defined as d = p31 p32 p33 p34 XT ab 1 T , (4) where pmn denotes the element in row m and column n of P. A measurement Tn is considered to conform with the model if it has positive depth dn and if its reprojection error ¯xn − x/z y/z en = with 2 ⎤ ⎦ = Pn ⎡ ⎣x y z Xab 1 (5) is smaller than a certain threshold t. RANSAC maximizes K as an iterative approach and generally it uniformly sam- ples the minimal set of size two at random. However, since it is likely to sample the same minimal set multiple times for small NT , we define our random sampler to only generate unique samples. To ensure with confidence η that at least one outlier-free minimal set has been sampled, RANSAC must run for at least K iterations. Since the a priori inlier ratio is unknown, we set it to a small initial value 0 and adapt K whenever we find a larger consensus set (adaptive stopping criterion). Because a feature track may contain multiple independent points, we run this procedure recur- sively by removing the consensus set from the remaining measurements. The recursion stops if the size of the lat- est consensus set is smaller than three. The evaluations in Sec. 5 demonstrate increased triangulation completeness at reduced computational cost for the proposed method. 4.4. Bundle Adjustment To mitigate accumulated errors, we perform BA after im- age registration and triangulation. Usually, there is no need to perform global BA after each step, since incremental SfM only affects the model locally. Hence, we perform local BA on the set of most-connected images after each image reg- istration. Analogous to VisualSFM, we perform global BA only after growing the model by a certain percentage, re- sulting in an amortized linear run-time of SfM. Parameterization. To account for potential outliers, we employ the Cauchy function as the robust loss function ρj in local BA. For problems up to a few hundred cameras, we use a sparse direct solver, and for larger problems, we rely on PCG. We use Ceres Solver [2] and provide the option to share camera models of arbitrary complexity among any combination of images. For unordered Internet photos, we rely on a simple camera model with one radial distortion parameter, as the estimation relies on pure self-calibration. Filtering. After BA, some observations do not conform with the model. Accordingly, we filter observations with large reprojection errors [53, 62]. Moreover, for each point, we check for well-conditioned geometry by enforcing a minimum triangulation angle over all pairs of viewing rays 4108
[53]. After global BA, we also check for degenerate cam- eras, e.g. those caused by panoramas or artificially enhanced images. Typically, those cameras only have outlier obser- vations or their intrinsics converge to a bogus minimum. Hence, we do not constrain the focal length and distortion parameters to an a priori fixed range but let them freely optimize in BA. Since principal point calibration is an ill- posed problem [15], we keep it fixed at the image center for uncalibrated cameras. Cameras with an abnormal field of view or a large distortion coefficient magnitude are consid- ered incorrectly estimated and filtered after global BA. Re-Triangulation. Analogous to VisualSfM, we perform re-triangulation (RT) to account for drift effects prior to global BA (pre-BA RT). However, BA often significantly improves camera and point parameters. Hence, we propose to extend the very effective pre-BA RT with an additional post-BA RT step. The purpose of this step is to improve the completeness of the reconstruction (compare Sec. 4.3) by continuing the tracks of points that previously failed to triangulate, e.g., due to inaccurate poses etc. Instead of increasing the triangulation thresholds, we only continue tracks with observations whose errors are below the filter- ing thresholds. In addition, we attempt to merge tracks and thereby provide increased redundancy for the next BA step. Iterative Refinement. Bundler and VisualSfM perform a single instance of BA and filtering. Due to drift or the pre-BA RT, usually a significant portion of the observations in BA are outliers and subsequently filtered. Since BA is severely affected by outliers, a second step of BA can sig- nificantly improve the results. We therefore propose to per- form BA, RT, and filtering in an iterative optimization until the number of filtered observations and post-BA RT points diminishes. In most cases, after the second iteration re- sults improve dramatically and the optimization converges. Sec. 5 demonstrates that the proposed iterative refinement significantly boosts reconstruction completeness. 4.5. Redundant View Mining BA is a major performance bottleneck in SfM. In this section, we propose a method that exploits the inherent characteristics of incremental SfM and dense photo collec- tions for a more efficient parameterization of BA by cluster- ing redundant cameras into groups of high scene overlap. Internet photo collections usually have a highly non- uniform visibility pattern due to varying popularity of points of interest. Moreover, unordered collections are usu- ally clustered into fragments of points that are co-visible in many images. A number of previous works exploit this fact to improve the efficiency of BA, including Kushal et al. [33] who analyze the visibility pattern for efficient precondition- ing of the reduced camera system. Ni et al. [43] partition the cameras and points into submaps, which are connected through separator variables, by posing the partitioning as a graph cut problem on the graph of connected camera and point parameters. BA then alternates between fixing the cameras and points and refining the separator variables, and vice versa. Another approach by Carlone et al. [11] col- lapses multiple points with a low-rank into a single factor imposing a high-rank constraint on the cameras, providing a computational advantage when cameras share many points. Our proposed method is motivated by these previous works. Similar to Ni et al., we partition the problem into submaps whose internal parameters are factored out. We have three main contributions: First, an efficient camera grouping scheme leveraging the inherent properties of SfM and replacing the expensive graph-cut employed by Ni et al. Second, instead of clustering many cameras into one submap, we partition the scene into many small, highly overlapping camera groups. The cameras within each group are collapsed into a single camera, thereby reducing the cost of solving the reduced camera system. Third, as a con- sequence of the much smaller, highly overlapping camera groups, we eliminate the alternation scheme of Ni et al. by skipping the separator variable optimization. SfM groups images and points into two sets based on whether their parameters were affected by the latest incre- mental model extension. For large problems, most of the scene remains unaffected since the model usually extends locally. BA naturally optimizes more for the newly ex- tended parts while other parts only improve in case of drift [62]. Moreover, Internet photo collections often have an un- even camera distribution with many redundant viewpoints. Motivated by these observations, we partition the unaffected scene parts into groups G = {Gr | r = 1...NG} of highly overlapping images and parameterize each group Gr as a single camera. Images affected by the latest extension are grouped independently to allow for an optimal refinement of their parameters. Note that this results in the standard BA parameterization (Eq. 1). For unaffected images, we create groups of cardinality NGr. We consider an image as affected if it was added during the latest model extension or if more than a ratio r of its observations have a reprojection error larger than r pixels (to refine re-triangulated cameras). Images within a group should be as redundant as possible [43], and the number of co-visible points between images is a measure to describe their degree of mutual interaction [33]. For a scene with NX points, each image can be de- scribed by a binary visibility vector vi ∈ {0, 1}NX , where the n-th entry in vi is 1 if point Xn is visible in image i and 0 otherwise. The degree of interaction between image a and b is calculated using bitwise operations on their vectors vi Vab = va ∧ vb /va ∨ vb . (6) To build groups, we sort the images as ¯I = {Ii|vi ≥ vi+1}. We initialize a group Gr by removing the first im- age Ia from ¯I and finding the image Ib that maximizes Vab. 4109
×105 10 e r o c S 8 6 4 2 0 0 μ = 0 μ = 0.1 μ = 0.2 μ = 0.3 μ = 0.4 μ = 0.5 0.05 Standard deviation σ 0.1 ×105 10 8 6 4 2 0 e r o c S 0.15 0 200 σ = 0 σ = 0.03 σ = 0.06 σ = 0.09 σ = 0.12 σ = 0.15 400 600 Number of points n o n u i r e v o n o i t c e s r e t n I 1 0.9 0.8 0.7 0.6 0.5 0 800 1000 ] m [ r o r r e i n a d e M 1 0.8 0.6 0.4 Pyramid - Number Pyramid - Ratio Number - Ratio Pyramid Ratio Number 1000 2000 3000 4000 5000 Number of registered images 50 100 150 200 Number of registered ground-truth images Figure 4. Next best view scores for Gaussian distributed points xj ∈ [0, 1]× [0, 1] with mean μ and std. dev. σ. Score S w.r.t. uni- formity (left) and number of points for μ = 0.5 (right). If Vab > V and |Gr| < S, image Ib is removed from ¯I and added to group Gr. Otherwise, a new group is initialized. To reduce the time of finding Ib, we employ the heuristic of limiting the search to the Kr spatial nearest neighbors with a common viewing direction in the range of ±β de- grees, motivated by the fact that those images have a high likelihood of sharing many points. Each image within a group is then parameterized w.r.t. a common group-local coordinate frame. The BA cost func- tion (Eq. 1) for grouped images is πg (Gr, Pc, Xk) − xj2 2 j ρj Eg = (7) using the extrinsic group parameters Gr ∈ SE(3) and fixed Pc. The projection matrix of an image in group r is then defined as the concatenation of the group and image pose Pcr = PcGr. The overall cost ¯E is the sum of the grouped and ungrouped cost contributions. For efficient concatena- tion of the rotational components of Gr and Pi, we rely on quaternions. A larger group size leads to a greater perfor- mance benefit due to a smaller relative overhead of com- puting πg over π. Note that even for the case of a group size of two, we observe a computational benefit. In addi- tion, the performance benefit depends on the problem size, as a reduction in the number of cameras affects the cubic computational complexity of direct methods more than the linear complexity of indirect methods (Sec. 2). 5. Experiments We run experiments on a large variety of datasets to eval- uate both the proposed components and the overall system compared to state-of-the-art incremental (Bundler [53], Vi- sualSFM [62]) and global SfM systems (DISCO [14], Theia [55]1). The 17 datasets contain a total of 144,953 unordered Internet photos, distributed over a large area and with highly varying camera density. In addition, Quad [14] has ground- truth camera locations. Throughout all experiments, we use RootSIFT features and match each image against its 100 nearest neighbors using a vocabulary tree trained on unre- lated datasets. To ensure comparability between the differ- ent methods, correspondence search is not included in the timings on a 2.7GHz machine with 256GB RAM. 1Results for Theia kindly provided by the authors [55]. Figure 5. Next best view results for Quad: Shared number of reg- istered images and reconstruction error during incremental SfM. s t n o p i f o . m u N 0 1 g o l 6 4 2 0 l s e p m a s f o . m u N 0 1 g o l 4 3 2 1 0 0 0.2 0.4 0.6 Outlier ratio 0.8 1 Exhaustive, non-rec. RANSAC, non-rec., η = 0.999 RANSAC, non-rec., η = 0.5 Exhaustive, rec. RANSAC, rec., η = 0.999 RANSAC, rec., η = 0.5 0 20 40 60 Track length 80 100 Figure 6. Triangulation statistics for Dubrovnik dataset. Left: Out- lier ratio distribution of feature tracks. Right: Average number of samples required to triangulate N-view point. 1 2 0 20 40 60 80 100 Image Registration Triangulation Local BA Global BA Filtering Figure 7. Average relative runtimes using standard global BA and exhaustive, rec. triangulation (1), and grouped BA and RANSAC, rec. triangulation (2). Runtime for Initialization and Next Best View Selection (all strategies) is smaller than 0.1%. Next Best View Selection. A synthetic experiment (Fig. 4) evaluates how well the score S reflects the num- ber and distribution of points. We use L = 6 pyramid levels, and we generate Gaussian-distributed image points with spread σ and location μ. A larger σ and a more cen- tral μ corresponds to a more uniform distribution and cor- rectly produces a higher score. Similarly, the score is dom- inated by the number of points when there are few and oth- erwise by their distribution in the image. Another experi- ment (Fig. 5) compares our method (Pyramid) to existing strategies in terms of the reconstruction error. The other methods are Number [53], which maximizes the number of triangulated points, and Ratio which maximizes the ratio of visible over potentially visible points. After each im- age registration, we measure the number of registered im- ages shared between the strategies (intersection over union) and the reconstruction error as the median distance to the ground-truth camera locations. While all strategies con- verge to the same set of registered images, our method pro- duces the most accurate reconstruction by choosing a better registration order for the images. Robust and Efficient Triangulation. An experiment on the Dubrovnik dataset (Fig. 6 and Table 2) evaluates our method on 2.9M feature tracks composed from 47M veri- fied matches. We compare against Bundler and an exhaus- tive strategy that samples all pairwise combinations in a track. We set α = 2◦, t = 8px, and 0 = 0.03. To avoid combinatorial explosion, we limit the exhaustive approach to 10K iterations, i.e. min ≈ 0.02 with η = 0.999. The di- verse inlier ratio distribution (as determined with the recur- 4110
# Images # Registered # Points (Avg. Track Length) Time [s] Avg. Reproj. Error [px] Ours Theia Bundler VSFM Ours Theia Bundler VSFM Ours 5.3M 1.2M 1.35M – – – 295,200 223,200 – 6,012 10,912 3,791 2,124 – 3,821 – – – – – – – – – Theia Bundler VSFM Ours Theia Bundler Rome [14] Quad [14] Dubrovnik [36] 74,394 6,514 6,044 – – – 13,455 5,028 – 14,797 5,624 – 20,918 5,860 5,913 – – – 5.4M 10.5M – VSFM 12.9M 0.8M – Alamo [61] Ellis Island [61] Gendarmenmarkt [61] Madrid Metropolis [61] Montreal Notre Dame [61] NYC Library [61] Piazza del Popolo [61] Piccadilly [61] Roman Forum [61] Tower of London [61] Trafalgar [61] Union Square [61] Vienna Cathedral [61] Yorkminster [61] 0.70 0.71 0.71 0.59 0.88 0.67 0.76 0.79 0.69 0.59 0.79 0.68 0.80 0.72 Table 1. Reconstruction results for state-of-the-art SfM systems on large-scale unordered Internet photo collections. 94K (11.6) 127K (4.5) 124K (8.9) 146K (6.0) 64K (6.8) 39K (4.1) 61K (5.5) 29K (4.9) 123K (6.1) 93K (3.7) 138K (4.9) 87K (3.8) 43K (6.6) 48K (5.2) 47K (5.0) 27K (3.2) 98K (8.7) 110K (7.1) 154K (5.4) 135K (4.6) 77K (7.1) 95K (5.5) 71K (3.7) 66K (4.1) 47K (8.8) 34K (3.7) 36K (5.2) 50K (7.2) 260K (7.9) 197K (3.9) 245K (6.9) 197K (4.9) 222K (7.8) 281K (4.4) 278K (5.7) 261K (4.9) 109K (7.4) 151K (4.8) 143K (5.7) 140K (5.2) 497K (8.7) 450K (10.1) 381K (4.8) 196K (3.7) 53K (8.2) 48K (3.7) 35K (5.3) 43K (7.1) 190K (9.8) 276K (4.6) 231K (7.6) 259K (4.9) 143K (4.5) 71K (3.9) 130K (5.2) 105K (6.8) 22,025 874 94 12,798 202 465,213 21,633 95 112,171 207 36,462 194 33,805 89 478,956 1,427 587,451 1,302 201 184,905 612,452 1,494 131 56,317 764 567,213 164 34,641 2,915 2,587 1,463 1,344 2,298 2,550 2,251 7,351 2,364 1,576 15,685 5,961 6,288 3,368 882 332 627 251 723 420 380 1,961 1,041 678 5,122 693 1,244 997 495 240 412 203 418 327 275 1,236 748 497 3,921 556 899 661 582 231 703 351 464 339 335 2,270 1,074 468 5,067 720 858 429 647 286 302 330 501 400 376 1,087 885 569 1,257 649 853 379 609 297 807 309 491 411 403 2,161 1,320 547 5,087 658 890 427 666 315 861 368 506 453 437 2,336 1,409 578 5,211 763 933 456 1.47 2.41 2.19 1.48 2.01 1.89 2.11 2.33 2.07 1.86 2.09 2.36 2.45 2.38 2.29 2.24 1.59 1.62 1.92 1.84 1.76 1.79 1.66 1.54 2.07 3.22 1.69 2.61 – – – 0.68 0.70 0.68 0.60 0.81 0.69 0.72 0.75 0.70 0.61 0.74 0.67 0.74 0.70 #Points #Elements Avg. Track Length #Samples Bundler 713,824 5.58 M 7.824 136.34 M Non-recursive Exhaustive RANSAC, η1 RANSAC, η2 Recursive Exhaustive RANSAC, η1 RANSAC, η2 861,591 861,591 860,422 894,294 902,844 906,501 7.64 M 7.54 M 7.46 M 8.05 M 8.03 M 7.98 M 8.877 8.760 8.682 9.003 8.888 8.795 120.44 M 3.89 M 3.03 M 145.22 M 12.69 M 7.82 M Table 2. Triangulation results using η1 = 0.99 and η2 = 0.5. s a r e m a c . m u N 150 100 50 0 Standard BA Grouped BA, V = 0.6 Grouped BA, V = 0.5 Grouped BA, V = 0.4 Grouped BA, V = 0.3 Grouped BA, V = 0.2 2 4 6 16 Number of global bundle adjustments 14 10 12 8 18 20 Figure 8. Number of cameras in standard and grouped BA using r = 0.02, S = 10, and varying scene overlap V . sive, exhaustive approach) evidences the need for a robust triangulation method. Our proposed recursive approaches recover significantly longer tracks and overall more track el- ements than their non-recursive counterparts. Note that the higher number of points for the recursive RANSAC-based methods corresponds to the slightly reduced track lengths. The RANSAC-based approach yields just marginally infe- rior tracks but is much faster (10-40x). By varying η, it is easy to balance speed against completeness. Redundant View Mining. We evaluate redundant view mining on an unordered collection of densely distributed images. Fig. 8 shows the growth rate of the parameterized cameras in global BA using a fixed number of BA itera- tions. Depending on the enforced scene overlap V , we can reduce the time for solving the reduced camera system by a significant factor. The effective speedup of the total run- time is 5% (V = 0.6), 14% (V = 0.3) and 32% (V = 0.1), while the average reprojection error degrades from 0.26px (standard BA) to 0.27px, 0.28px, and 0.29px, respectively. The reconstruction quality is comparable for all choices of V > 0.3 and increasingly degrades for a smaller V . Using V = 0.4, the runtime of the entire pipeline for Colosseum reduces by 36% yet results in an equivalent reconstruction. System. Table 1 and Fig. 1 demonstrate the performance Figure 9. Reconstruction of Gendarmenmarkt [61] for Bundler (left) and our method (right). of the overall system and thereby also evaluate the perfor- mance of the individual proposed components of the sys- tem. For each dataset, we report the largest reconstructed component. Theia is the fastest method, while our method achieves slightly worse timings than VisualSFM and is more than 50 times faster than Bundler. Fig. 7 shows the rela- tive timings of the individual modules. For all datasets, we significantly outperform any other method in terms of com- pleteness, especially for the larger models. Importantly, the increased track lengths result in higher redundancy in BA. In addition, we achieve the best pose accuracy for the Quad dataset: DISCO 1.16m, Bundler 1.01m, VisualSFM 0.89m, and Ours 0.85m. Fig. 9 shows a result of Bundler com- pared to our method. We encourage readers to view the supplementary material for additional visual comparisons of the results, demonstrating the superior robustness, com- pleteness, and accuracy of our method. 6. Conclusion This paper proposes a SfM algorithm that overcomes key challenges to make a further step towards a general-purpose SfM system. The proposed components of the algorithm improve the state of the art in terms of completeness, ro- bustness, accuracy, and efficiency. Comprehensive experi- ments on challenging large-scale datasets demonstrate the performance of the individual components and the overall system. The entire algorithm is released to the public as an open-source implementation. Acknowledgements. We thank J. Heinly and T. Price for proofreading. We also thank C. Sweeney for producing the Theia experiments. Supported in part by the NSF No. IIS- 1349074, No. CNS-1405847, and the MITRE Corp. 4111
分享到:
收藏