XGBoost介绍.pdf

发布时间：2022-05-31 发布人：admin 分类：说明书资料大小：0.92M 资料格式：pdf 举报版权申诉

houqiking-10352170-4744300845147738443.pdf-第1页.png

第1页 / 共13页

houqiking-10352170-4744300845147738443.pdf-第2页.png

第2页 / 共13页

houqiking-10352170-4744300845147738443.pdf-第3页.png

第3页 / 共13页

houqiking-10352170-4744300845147738443.pdf-第4页.png

第4页 / 共13页

houqiking-10352170-4744300845147738443.pdf-第5页.png

第5页 / 共13页

houqiking-10352170-4744300845147738443.pdf-第6页.png

第6页 / 共13页

houqiking-10352170-4744300845147738443.pdf-第7页.png

第7页 / 共13页

houqiking-10352170-4744300845147738443.pdf-第8页.png

第8页 / 共13页

1 Introduction

2 Tree Boosting in a NutShell

2.1 Regularized Learning Objective

2.2 Gradient Tree Boosting

2.3 Shrinkage and Column Subsampling

3 Split Finding Algorithms

3.1 Basic Exact Greedy Algorithm

3.2 Approximate Algorithm

3.3 Weighted Quantile Sketch

3.4 Sparsity-aware Split Finding

4 System Design

4.1 Column Block for Parallel Learning

4.2 Cache-aware Access

4.3 Blocks for Out-of-core Computation

5 Related Works

6 End to End Evaluations

6.1 System Implementation

6.2 Dataset and Setup

6.3 Classification

6.4 Learning to Rank

6.5 Out-of-core Experiment

6.6 Distributed Experiment

7 Conclusion

8 References

A Weighted Quantile Sketch

A.1 Formalization and Definitions

A.2 Construction of Initial Summary

A.3 Merge Operation

A.4 Prune Operation

XGBoost: A Scalable Tree Boosting System Tianqi Chen University of Washington tqchen@cs.washington.edu Carlos Guestrin University of Washington guestrin@cs.washington.edu 6 1 0 2 y a M 3 2 ] G L . s c [ 2 v 4 5 7 2 0 . 3 0 6 1 : v i X r a ABSTRACT Tree boosting is a highly eﬀective and widely used machine learning method. In this paper, we describe a scalable end- to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quan- tile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compres- sion and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems. CCS Concepts •Methodologies → Machine learning; •Information systems → Data mining; Keywords Large-scale Machine Learning 1. INTRODUCTION Machine learning and data-driven approaches are becom- ing very important in many areas. Smart spam classiﬁers protect our email by learning from massive amounts of spam data and user feedback; advertising systems learn to match the right ads with the right context; fraud detection systems protect banks from malicious attackers; anomaly event de- tection systems help experimental physicists to ﬁnd events that lead to new physics. There are two important factors that drive these successful applications: usage of eﬀective (statistical) models that capture the complex data depen- dencies and scalable learning systems that learn the model of interest from large datasets. Among the machine learning methods used in practice, gradient tree boosting [10]1 is one technique that shines 1Gradient tree boosting is also known as gradient boosting machine (GBM) or gradient boosted regression tree (GBRT) Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full cita- tion on the ﬁrst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. Request permissions from permissions@acm.org. KDD ’16, August 13-17, 2016, San Francisco, CA, USA c 2016 ACM. ISBN 978-1-4503-4232-2/16/08. . . $15.00 DOI: http://dx.doi.org/10.1145/2939672.2939785 in many applications. Tree boosting has been shown to give state-of-the-art results on many standard classiﬁcation benchmarks [16]. LambdaMART [5], a variant of tree boost- ing for ranking, achieves state-of-the-art result for ranking problems. Besides being used as a stand-alone predictor, it is also incorporated into real-world production pipelines for ad click through rate prediction [15]. Finally, it is the de- facto choice of ensemble method and is used in challenges such as the Netﬂix prize [3]. In this paper, we describe XGBoost, a scalable machine learning system for tree boosting. The system is available as an open source package2. The impact of the system has been widely recognized in a number of machine learning and data mining challenges. Take the challenges hosted by the ma- chine learning competition site Kaggle for example. Among the 29 challenge winning solutions 3 published at Kaggle’s blog during 2015, 17 solutions used XGBoost. Among these solutions, eight solely used XGBoost to train the model, while most others combined XGBoost with neural nets in en- sembles. For comparison, the second most popular method, deep neural nets, was used in 11 solutions. The success of the system was also witnessed in KDDCup 2015, where XGBoost was used by every winning team in the top-10. Moreover, the winning teams reported that ensemble meth- ods outperform a well-conﬁgured XGBoost by only a small amount [1]. These results demonstrate that our system gives state-of- the-art results on a wide range of problems. Examples of the problems in these winning solutions include: store sales prediction; high energy physics event classiﬁcation; web text classiﬁcation; customer behavior prediction; motion detec- tion; ad click through rate prediction; malware classiﬁcation; product categorization; hazard risk prediction; massive on- line course dropout rate prediction. While domain depen- dent data analysis and feature engineering play an important role in these solutions, the fact that XGBoost is the consen- sus choice of learner shows the impact and importance of our system and tree boosting. The most important factor behind the success of XGBoost is its scalability in all scenarios. The system runs more than ten times faster than existing popular solutions on a single machine and scales to billions of examples in distributed or memory-limited settings. The scalability of XGBoost is due to several important systems and algorithmic optimizations. These innovations include: a novel tree learning algorithm is for handling sparse data; a theoretically justiﬁed weighted 2https://github.com/dmlc/xgboost 3Solutions come from of top-3 teams of each competitions.

2.1 Regularized Learning Objective For a given data set with n examples and m features D = {(xi, yi)} (|D| = n, xi ∈ Rm, yi ∈ R), a tree ensem- ble model (shown in Fig. 1) uses K additive functions to predict the output. ˆyi = φ(xi) = fk(xi), fk ∈ F , (1) K k=1 Figure 1: Tree Ensemble Model. The ﬁnal predic- tion for a given example is the sum of predictions from each tree. quantile sketch procedure enables handling instance weights in approximate tree learning. Parallel and distributed com- puting makes learning faster which enables quicker model ex- ploration. More importantly, XGBoost exploits out-of-core computation and enables data scientists to process hundred millions of examples on a desktop. Finally, it is even more exciting to combine these techniques to make an end-to-end system that scales to even larger data with the least amount of cluster resources. The major contributions of this paper is listed as follows: • We design and build a highly scalable end-to-end tree boosting system. • We propose a theoretically justiﬁed weighted quantile sketch for eﬃcient proposal calculation. • We introduce a novel sparsity-aware algorithm for par- allel tree learning. • We propose an eﬀective cache-aware block structure for out-of-core tree learning. While there are some existing works on parallel tree boost- ing [22, 23, 19], the directions such as out-of-core compu- tation, cache-aware and sparsity-aware learning have not been explored. More importantly, an end-to-end system that combines all of these aspects gives a novel solution for real-world use-cases. This enables data scientists as well as researchers to build powerful variants of tree boosting al- gorithms [7, 8]. Besides these major contributions, we also make additional improvements in proposing a regularized learning objective, which we will include for completeness. The remainder of the paper is organized as follows. We will ﬁrst review tree boosting and introduce a regularized objective in Sec. 2. We then describe the split ﬁnding meth- ods in Sec. 3 as well as the system design in Sec. 4, including experimental results when relevant to provide quantitative support for each optimization we describe. Related work is discussed in Sec. 5. Detailed end-to-end evaluations are included in Sec. 6. Finally we conclude the paper in Sec. 7. 2. TREE BOOSTING IN A NUTSHELL We review gradient tree boosting algorithms in this sec- tion. The derivation follows from the same idea in existing literatures in gradient boosting. Specicially the second order method is originated from Friedman et al. [12]. We make mi- nor improvements in the reguralized objective, which were found helpful in practice. where F = {f (x) = wq(x)}(q : Rm → T, w ∈ RT ) is the space of regression trees (also known as CART). Here q rep- resents the structure of each tree that maps an example to the corresponding leaf index. T is the number of leaves in the tree. Each fk corresponds to an independent tree structure q and leaf weights w. Unlike decision trees, each regression tree contains a continuous score on each of the leaf, we use wi to represent score on i-th leaf. For a given example, we will use the decision rules in the trees (given by q) to classify it into the leaves and calculate the ﬁnal prediction by sum- ming up the score in the corresponding leaves (given by w). To learn the set of functions used in the model, we minimize the following regularized objective. L(φ) = l(ˆyi, yi) + Ω(fk) i where Ω(f ) = γT + k λw2 1 2 (2) Here l is a diﬀerentiable convex loss function that measures the diﬀerence between the prediction ˆyi and the target yi. The second term Ω penalizes the complexity of the model (i.e., the regression tree functions). The additional regular- ization term helps to smooth the ﬁnal learnt weights to avoid over-ﬁtting. Intuitively, the regularized objective will tend to select a model employing simple and predictive functions. A similar regularization technique has been used in Regu- larized greedy forest (RGF) [25] model. Our objective and the corresponding learning algorithm is simpler than RGF and easier to parallelize. When the regularization parame- ter is set to zero, the objective falls back to the traditional gradient tree boosting. 2.2 Gradient Tree Boosting The tree ensemble model in Eq. (2) includes functions as parameters and cannot be optimized using traditional opti- mization methods in Euclidean space. Instead, the model is trained in an additive manner. Formally, let ˆy(t) be the prediction of the i-th instance at the t-th iteration, we will need to add ft to minimize the following objective. i L(t) = l(yi, ˆyi (t−1) + ft(xi)) + Ω(ft) n i=1 L(t) n i=1 This means we greedily add the ft that most improves our model according to Eq. (2). Second-order approximation can be used to quickly optimize the objective in the general setting [12]. [l(yi, ˆy(t−1)) + gift(xi) + 1 2 hif 2 t (xi)] + Ω(ft) ˆy(t−1) l(yi, ˆy(t−1)) where gi = ∂ˆy(t−1) l(yi, ˆy(t−1)) and hi = ∂2 are ﬁrst and second order gradient statistics on the loss func- tion. We can remove the constant terms to obtain the fol-

˜L(t) = [gift(xi) + 1 2 hif 2 t (xi)] + Ω(ft) (3) Algorithm 2: Approximate Algorithm for Split Finding Figure 2: Structure Score Calculation. We only need to sum up the gradient and second order gra- dient statistics on each leaf, then apply the scoring formula to get the quality score. lowing simpliﬁed objective at step t. n i=1 Deﬁne Ij = {i|q(xi) = j} as the instance set of leaf j. We can rewrite Eq (3) by expanding Ω as follows n T i=1 j=1 i∈Ij [( ˜L(t) = = T 1 2 λ j=1 (4) [gift(xi) + hif 2 t (xi)] + γT + w2 j gi)wj + 1 2 ( hi + λ)w2 j ] + γT Algorithm 1: Exact Greedy Algorithm for Split Finding Input: I, instance set of current node Input: d, feature dimension gain ← 0 G ← i∈I gi, H ← i∈I hi for k = 1 to m do GL ← 0, HL ← 0 for j in sorted(I, by xjk) do GL ← GL + gj, HL ← HL + hj GR ← G − GL, HR ← H − HL score ← max(score, G2 HL+λ + G2 L HR+λ − G2 H+λ ) R end end Output: Split with max score 1 2 i∈Ij ( T i∈Ij j=1 For a ﬁxed structure q(x), we can compute the optimal weight w∗ j of leaf j by j = − ∗ w i∈Ij gi hi + λ , (5) and calculate the corresponding optimal value by ˜L(t)(q) = − 1 2 i∈Ij gi)2 hi + λ i∈Ij + γT. (6) ( ( − ( Eq (6) can be used as a scoring function to measure the quality of a tree structure q. This score is like the impurity score for evaluating decision trees, except that it is derived for a wider range of objective functions. Fig. 2 illustrates how this score can be calculated. Normally it is impossible to enumerate all the possible tree structures q. A greedy algorithm that starts from a single leaf and iteratively adds branches to the tree is used instead. Assume that IL and IR are the instance sets of left and right nodes after the split. Lettting I = IL ∪ IR, then the loss reduction after the split is given by Lsplit = 1 2 i∈IL gi)2 hi + λ i∈IL + i∈IR gi)2 hi + λ i∈IR i∈I gi)2 i∈I hi + λ −γ (7) This formula is usually used in practice for evaluating the split candidates. 2.3 Shrinkage and Column Subsampling Besides the regularized objective mentioned in Sec. 2.1, two additional techniques are used to further prevent over- ﬁtting. The ﬁrst technique is shrinkage introduced by Fried- man [11]. Shrinkage scales newly added weights by a factor η after each step of tree boosting. Similar to a learning rate in tochastic optimization, shrinkage reduces the inﬂuence of for k = 1 to m do Propose Sk = {sk1, sk2,··· skl} by percentiles on feature k. Proposal can be done per tree (global), or per split(local). end for k = 1 to m do Gkv ←= Hkv ←= j∈{j|sk,v≥xjk>sk,v−1} gj j∈{j|sk,v≥xjk>sk,v−1} hj end Follow same step as in previous section to ﬁnd max score only among proposed splits. each individual tree and leaves space for future trees to im- prove the model. The second technique is column (feature) subsampling. This technique is used in RandomForest [4, 13], It is implemented in a commercial software TreeNet 4 for gradient boosting, but is not implemented in existing opensource packages. According to user feedback, using col- umn sub-sampling prevents over-ﬁtting even more so than the traditional row sub-sampling (which is also supported). The usage of column sub-samples also speeds up computa- tions of the parallel algorithm described later. 3. SPLIT FINDING ALGORITHMS 3.1 Basic Exact Greedy Algorithm One of the key problems in tree learning is to ﬁnd the best split as indicated by Eq (7). In order to do so, a split ﬁnding algorithm enumerates over all the possible splits on all the features. We call this the exact greedy algorithm. Most existing single machine tree boosting implementations, such as scikit-learn [20], R’s gbm [21] as well as the single machine version of XGBoost support the exact greedy algo- rithm. The exact greedy algorithm is shown in Alg. 1. It is computationally demanding to enumerate all the possible splits for continuous features. In order to do so eﬃciently, the algorithm must ﬁrst sort the data according to feature values and visit the data in sorted order to accumulate the gradient statistics for the structure score in Eq (7). 3.2 Approximate Algorithm 4https://www.salford-systems.com/products/treenet

Figure 3: Comparison of test AUC convergence on Higgs 10M dataset. The eps parameter corresponds to the accuracy of the approximate sketch. This roughly translates to 1 / eps buckets in the proposal. We ﬁnd that local proposals require fewer buckets, because it reﬁne split candidates. The exact greedy algorithm is very powerful since it enu- merates over all possible splitting points greedily. However, it is impossible to eﬃciently do so when the data does not ﬁt entirely into memory. Same problem also arises in the dis- tributed setting. To support eﬀective gradient tree boosting in these two settings, an approximate algorithm is needed. We summarize an approximate framework, which resem- bles the ideas proposed in past literatures [17, 2, 22], in Alg. 2. To summarize, the algorithm ﬁrst proposes can- didate splitting points according to percentiles of feature distribution (a speciﬁc criteria will be given in Sec. 3.3). The algorithm then maps the continuous features into buck- ets split by these candidate points, aggregates the statistics and ﬁnds the best solution among proposals based on the aggregated statistics. There are two variants of the algorithm, depending on when the proposal is given. The global variant proposes all the candidate splits during the initial phase of tree construc- tion, and uses the same proposals for split ﬁnding at all lev- els. The local variant re-proposes after each split. The global method requires less proposal steps than the local method. However, usually more candidate points are needed for the global proposal because candidates are not reﬁned after each split. The local proposal reﬁnes the candidates after splits, and can potentially be more appropriate for deeper trees. A comparison of diﬀerent algorithms on a Higgs boson dataset is given by Fig. 3. We ﬁnd that the local proposal indeed requires fewer candidates. The global proposal can be as accurate as the local one given enough candidates. Most existing approximate algorithms for distributed tree learning also follow this framework. Notably, it is also possi- ble to directly construct approximate histograms of gradient statistics [22]. It is also possible to use other variants of bin- ning strategies instead of quantile [17]. Quantile strategy beneﬁt from being distributable and recomputable, which we will detail in next subsection. From Fig. 3, we also ﬁnd that the quantile strategy can get the same accuracy as exact greedy given reasonable approximation level. Our system eﬃciently supports exact greedy for the single machine setting, as well as approximate algorithm with both local and global proposal methods for all settings. Users can freely choose between the methods according to their needs. Figure 4: Tree structure with default directions. An example will be classiﬁed into the default direction when the feature needed for the split is missing. 3.3 Weighted Quantile Sketch One important step in the approximate algorithm is to propose candidate split points. Usually percentiles of a fea- ture are used to make candidates distribute evenly on the data. Formally, let multi-set Dk = {(x1k, h1), (x2k, h2)··· (xnk, hn)} represent the k-th feature values and second order gradient statistics of each training instances. We can deﬁne a rank functions rk : R → [0, +∞) as 1 rk(z) = (x,h)∈Dk h (x,h)∈Dk,x

Figure 6: Block structure for parallel learning. Each column in a block is sorted by the corresponding feature value. A linear scan over one column in the block is suﬃcient to enumerate all the split points. Algorithm 3: Sparsity-aware Split Finding Input: I, instance set of current node Input: Ik = {i ∈ I|xik = missing} Input: d, feature dimension Also applies to the approximate setting, only collect statistics of non-missing entries into buckets gain ← 0 G ← i∈I , gi,H ← i∈I hi for k = 1 to m do // enumerate missing value goto right GL ← 0, HL ← 0 for j in sorted(Ik, ascent order by xjk) do GL ← GL + gj, HL ← HL + hj GR ← G − GL, HR ← H − HL score ← max(score, G2 HL+λ + G2 L HR+λ − G2 H+λ ) R end // enumerate missing value goto left GR ← 0, HR ← 0 for j in sorted(Ik, descent order by xjk) do GR ← GR + gj, HR ← HR + hj GL ← G − GR, HL ← H − HR score ← max(score, G2 HL+λ + G2 L HR+λ − G2 H+λ ) R end end Output: Split and default directions with max gain classiﬁed into the default direction. There are two choices of default direction in each branch. The optimal default di- rections are learnt from the data. The algorithm is shown in Alg. 3. The key improvement is to only visit the non-missing entries Ik. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values. The same algorithm can also be applied when the non-presence corresponds to a user speciﬁed value by limiting the enumeration only to consistent solutions. To the best of our knowledge, most existing tree learn- ing algorithms are either only optimized for dense data, or need speciﬁc procedures to handle limited cases such as cat- egorical encoding. XGBoost handles all sparsity patterns in a uniﬁed way. More importantly, our method exploits the sparsity to make computation complexity linear to number of non-missing entries in the input. Fig. 5 shows the com- parison of sparsity aware and a naive implementation on an Allstate-10K dataset (description of dataset given in Sec. 6). We ﬁnd that the sparsity aware algorithm runs 50 times faster than the naive version. This conﬁrms the importance of the sparsity aware algorithm. Figure 5: Impact of the sparsity aware algorithm on Allstate-10K. The dataset is sparse mainly due to one-hot encoding. The sparsity aware algorithm is more than 50 times faster than the naive version that does not take sparsity into consideration. 4. SYSTEM DESIGN 4.1 Column Block for Parallel Learning The most time consuming part of tree learning is to get the data into sorted order. In order to reduce the cost of sorting, we propose to store the data in in-memory units, which we called block. Data in each block is stored in the compressed column (CSC) format, with each column sorted by the corresponding feature value. This input data layout only needs to be computed once before training, and can be reused in later iterations. In the exact greedy algorithm, we store the entire dataset in a single block and run the split search algorithm by lin- early scanning over the pre-sorted entries. We do the split ﬁnding of all leaves collectively, so one scan over the block will collect the statistics of the split candidates in all leaf branches. Fig. 6 shows how we transform a dataset into the format and ﬁnd the optimal split using the block structure. The block structure also helps when using the approxi- mate algorithms. Multiple blocks can be used in this case, with each block corresponding to subset of rows in the dataset. Diﬀerent blocks can be distributed across machines, or stored on disk in the out-of-core setting. Using the sorted struc- ture, the quantile ﬁnding step becomes a linear scan over the sorted columns. This is especially valuable for local pro- posal algorithms, where candidates are generated frequently at each branch. The binary search in histogram aggregation also becomes a linear time merge style algorithm. Collecting statistics for each column can be parallelized, giving us a parallel algorithm for split ﬁnding. Importantly, the column block structure also supports column subsam- pling, as it is easy to select a subset of columns in a block. 124816Number of Threads0.031250.06250.1250.250.512481632Time per Tree(sec)Sparsity aware algorithmBasic algorithm

(a) Allstate 10M (b) Higgs 10M (c) Allstate 1M (d) Higgs 1M Figure 7: Impact of cache-aware prefetching in exact greedy algorithm. We ﬁnd that the cache-miss eﬀect impacts the performance on the large datasets (10 million instances). Using cache aware prefetching improves the performance by factor of two when the dataset is large. Figure 8: that can cause stall due to cache miss. Short range data dependency pattern Time Complexity Analysis Let d be the maximum depth of the tree and K be total number of trees. For the ex- act greedy algorithm, the time complexity of original spase aware algorithm is O(Kdx0 log n). Here we use x0 to denote number of non-missing entries in the training data. On the other hand, tree boosting on the block structure only cost O(Kdx0 + x0 log n). Here O(x0 log n) is the one time preprocessing cost that can be amortized. This analysis shows that the block structure helps to save an additional log n factor, which is signiﬁcant when n is large. For the approximate algorithm, the time complexity of original al- gorithm with binary search is O(Kdx0 log q). Here q is the number of proposal candidates in the dataset. While q is usually between 32 and 100, the log factor still introduces overhead. Using the block structure, we can reduce the time to O(Kdx0 +x0 log B), where B is the maximum num- ber of rows in each block. Again we can save the additional log q factor in computation. 4.2 Cache-aware Access While the proposed block structure helps optimize the computation complexity of split ﬁnding, the new algorithm requires indirect fetches of gradient statistics by row index, since these values are accessed in order of feature. This is a non-continuous memory access. A naive implementation of split enumeration introduces immediate read/write de- pendency between the accumulation and the non-continuous memory fetch operation (see Fig. 8). This slows down split ﬁnding when the gradient statistics do not ﬁt into CPU cache and cache miss occur. For the exact greedy algorithm, we can alleviate the prob- lem by a cache-aware prefetching algorithm. Speciﬁcally, we allocate an internal buﬀer in each thread, fetch the gra- dient statistics into it, and then perform accumulation in a mini-batch manner. This prefetching changes the direct read/write dependency to a longer dependency and helps to reduce the runtime overhead when number of rows in the is large. Figure 7 gives the comparison of cache-aware vs. (a) Allstate 10M (b) Higgs 10M Figure 9: The impact of block size in the approxi- mate algorithm. We ﬁnd that overly small blocks re- sults in ineﬃcient parallelization, while overly large blocks also slows down training due to cache misses. non cache-aware algorithm on the the Higgs and the All- state dataset. We ﬁnd that cache-aware implementation of the exact greedy algorithm runs twice as fast as the naive version when the dataset is large. For approximate algorithms, we solve the problem by choos- ing a correct block size. We deﬁne the block size to be max- imum number of examples in contained in a block, as this reﬂects the cache storage cost of gradient statistics. Choos- ing an overly small block size results in small workload for each thread and leads to ineﬃcient parallelization. On the other hand, overly large blocks result in cache misses, as the gradient statistics do not ﬁt into the CPU cache. A good choice of block size balances these two factors. We compared various choices of block size on two data sets. The results 124816Number of Threads8163264128Time per Tree(sec)Basic algorithmCache-aware algorithm124816Number of Threads8163264128256Time per Tree(sec)Basic algorithmCache-aware algorithm124816Number of Threads0.250.51248Time per Tree(sec)Basic algorithmCache-aware algorithm124816Number of Threads0.250.51248Time per Tree(sec)Basic algorithmCache-aware algorithm124816Number of Threads48163264128Time per Tree(sec)block size=2^12block size=2^16block size=2^20block size=2^24124816Number of Threads48163264128256512Time per Tree(sec)block size=2^12block size=2^16block size=2^20block size=2^24

Table 1: Comparison of major tree boosting systems. System XGBoost pGBRT Spark MLLib H2O scikit-learn R GBM exact greedy yes no no no yes yes approximate global yes no yes yes no no approximate local yes yes no no no no out-of-core yes no no no no no sparsity aware yes no partially partially no partially parallel yes yes yes yes no no are given in Fig. 9. This result validates our discussion and shows that choosing 216 examples per block balances the cache property and parallelization. 4.3 Blocks for Out-of-core Computation One goal of our system is to fully utilize a machine’s re- sources to achieve scalable learning. Besides processors and memory, it is important to utilize disk space to handle data that does not ﬁt into main memory. To enable out-of-core computation, we divide the data into multiple blocks and store each block on disk. During computation, it is impor- tant to use an independent thread to pre-fetch the block into a main memory buﬀer, so computation can happen in con- currence with disk reading. However, this does not entirely solve the problem since the disk reading takes most of the computation time. It is important to reduce the overhead and increase the throughput of disk IO. We mainly use two techniques to improve the out-of-core computation. Block Compression The ﬁrst technique we use is block compression. The block is compressed by columns, and de- compressed on the ﬂy by an independent thread when load- ing into main memory. This helps to trade some of the computation in decompression with the disk reading cost. We use a general purpose compression algorithm for com- pressing the features values. For the row index, we substract the row index by the begining index of the block and use a 16bit integer to store each oﬀset. This requires 216 examples per block, which is conﬁrmed to be a good setting. In most of the dataset we tested, we achieve roughly a 26% to 29% compression ratio. Block Sharding The second technique is to shard the data onto multiple disks in an alternative manner. A pre-fetcher thread is assigned to each disk and fetches the data into an in-memory buﬀer. The training thread then alternatively reads the data from each buﬀer. This helps to increase the throughput of disk reading when multiple disks are available. 5. RELATED WORKS Our system implements gradient boosting [10], which per- forms additive optimization in functional space. Gradient tree boosting has been successfully used in classiﬁcation [12], learning to rank [5], structured prediction [8] as well as other ﬁelds. XGBoost incorporates a regularized model to prevent overﬁtting. This this resembles previous work on regularized greedy forest [25], but simpliﬁes the objective and algorithm for parallelization. Column sampling is a simple but eﬀective technique borrowed from RandomForest [4]. While sparsity- aware learning is essential in other types of models such as linear models [9], few works on tree learning have considered this topic in a principled way. The algorithm proposed in this paper is the ﬁrst uniﬁed approach to handle all kinds of sparsity patterns. There are several existing works on parallelizing tree learn- ing [22, 19]. Most of these algorithms fall into the ap- proximate framework described in this paper. Notably, it is also possible to partition data by columns [23] and ap- ply the exact greedy algorithm. This is also supported in our framework, and the techniques such as cache-aware pre- fecthing can be used to beneﬁt this type of algorithm. While most existing works focus on the algorithmic aspect of par- allelization, our work improves in two unexplored system di- rections: out-of-core computation and cache-aware learning. This gives us insights on how the system and the algorithm can be jointly optimized and provides an end-to-end system that can handle large scale problems with very limited com- puting resources. We also summarize the comparison be- tween our system and existing opensource implementations in Table 1. Quantile summary (without weights) is a classical prob- lem in the database community [14, 24]. However, the ap- proximate tree boosting algorithm reveals a more general problem – ﬁnding quantiles on weighted data. To the best of our knowledge, the weighted quantile sketch proposed in this paper is the ﬁrst method to solve this problem. The weighted quantile summary is also not speciﬁc to the tree learning and can beneﬁt other applications in data science and machine learning in the future. 6. END TO END EVALUATIONS 6.1 System Implementation We implemented XGBoost as an open source package5. The package is portable and reusable. It supports various weighted classiﬁcation and rank objective functions, as well as user deﬁned objective function. It is available in popular languages such as python, R, Julia and integrates naturally with language native data science pipelines such as scikit- learn. The distributed version is built on top of the rabit library6 for allreduce. The portability of XGBoost makes it available in many ecosystems, instead of only being tied to a speciﬁc platform. The distributed XGBoost runs natively on Hadoop, MPI Sun Grid engine. Recently, we also enable distributed XGBoost on jvm bigdata stacks such as Flink and Spark. The distributed version has also been integrated into cloud platform Tianchi7 of Alibaba. We believe that there will be more integrations in the future. 6.2 Dataset and Setup 5https://github.com/dmlc/xgboost 6https://github.com/dmlc/rabit 7https://tianchi.aliyun.com

Table 2: Dataset used in the Experiments. n m Dataset 10 M 4227 Allstate Higgs Boson 10 M 28 Yahoo LTRC 473K 700 Criteo 67 1.7 B Task Insurance claim classiﬁcation Event classiﬁcation Learning to Rank Click through rate prediction We used four datasets in our experiments. A summary of these datasets is given in Table 2. In some of the experi- ments, we use a randomly selected subset of the data either due to slow baselines or to demonstrate the performance of the algorithm with varying dataset size. We use a suﬃx to denote the size in these cases. For example Allstate-10K means a subset of the Allstate dataset with 10K instances. The ﬁrst dataset we use is the Allstate insurance claim dataset8. The task is to predict the likelihood and cost of an insurance claim given diﬀerent risk factors. In the exper- iment, we simpliﬁed the task to only predict the likelihood of an insurance claim. This dataset is used to evaluate the impact of sparsity-aware algorithm in Sec. 3.4. Most of the sparse features in this data come from one-hot encoding. We randomly select 10M instances as training set and use the rest as evaluation set. The second dataset is the Higgs boson dataset9 from high energy physics. The data was produced using Monte Carlo simulations of physics events. It contains 21 kinematic prop- erties measured by the particle detectors in the accelerator. It also contains seven additional derived physics quantities of the particles. The task is to classify whether an event corresponds to the Higgs boson. We randomly select 10M instances as training set and use the rest as evaluation set. The third dataset is the Yahoo! learning to rank challenge dataset [6], which is one of the most commonly used bench- marks in learning to rank algorithms. The dataset contains 20K web search queries, with each query corresponding to a list of around 22 documents. The task is to rank the docu- ments according to relevance of the query. We use the oﬃcial train test split in our experiment. The last dataset is the criteo terabyte click log dataset10. We use this dataset to evaluate the scaling property of the system in the out-of-core and the distributed settings. The data contains 13 integer features and 26 ID features of user, item and advertiser information. Since a tree based model is better at handling continuous features, we preprocess the data by calculating the statistics of average CTR and count of ID features on the ﬁrst ten days, replacing the ID fea- tures by the corresponding count statistics during the next ten days for training. The training set after preprocessing contains 1.7 billion instances with 67 features (13 integer, 26 average CTR statistics and 26 counts). The entire dataset is more than one terabyte in LibSVM format. We use the ﬁrst three datasets for the single machine par- allel setting, and the last dataset for the distributed and out-of-core settings. All the single machine experiments are conducted on a Dell PowerEdge R420 with two eight-core Intel Xeon (E5-2470) (2.3GHz) and 64GB of memory. If not speciﬁed, all the experiments are run using all the avail- 8https://www.kaggle.com/c/ClaimPredictionChallenge 9https://archive.ics.uci.edu/ml/datasets/HIGGS 10http://labs.criteo.com/downloads/download-terabyte- click-logs/ Table 3: Comparison of Exact Greedy Methods with 500 trees on Higgs-1M data. Method XGBoost XGBoost (colsample=0.5) scikit-learn R.gbm Time per Tree (sec) Test AUC 0.6841 0.6401 28.51 1.032 0.8304 0.8245 0.8302 0.6224 Figure 10: Comparison between XGBoost and pG- BRT on Yahoo LTRC dataset. Table 4: Comparison of Learning to Rank with 500 trees on Yahoo! LTRC Dataset Time per Tree (sec) NDCG@10 Method XGBoost XGBoost (colsample=0.5) pGBRT [22] 0.826 0.506 2.576 0.7892 0.7913 0.7915 able cores in the machine. The machine settings of the dis- tributed and the out-of-core experiments will be described in the corresponding section. In all the experiments, we boost trees with a common setting of maximum depth equals 8, shrinkage equals 0.1 and no column subsampling unless ex- plicitly speciﬁed. We can ﬁnd similar results when we use other settings of maximum depth. 6.3 Classiﬁcation In this section, we evaluate the performance of XGBoost on a single machine using the exact greedy algorithm on Higgs-1M data, by comparing it against two other commonly used exact greedy tree boosting implementations. Since scikit-learn only handles non-sparse input, we choose the dense Higgs dataset for a fair comparison. We use the 1M subset to make scikit-learn ﬁnish running in reasonable time. Among the methods in comparison, R’s GBM uses a greedy approach that only expands one branch of a tree, which makes it faster but can result in lower accuracy, while both scikit-learn and XGBoost learn a full tree. The results are shown in Table 3. Both XGBoost and scikit-learn give better performance than R’s GBM, while XGBoost runs more than 10x faster than scikit-learn. In this experiment, we also ﬁnd column subsamples gives slightly worse performance than using all the features. This could due to the fact that there are few important features in this dataset and we can beneﬁt from greedily select from all the features. 6.4 Learning to Rank We next evaluate the performance of XGBoost on the 124816Number of Threads0.512481632Time per Tree(sec)XGBoostpGBRT

分享到：

赞收藏

资料库

XGBoost介绍.pdf

相关推荐

人工智能

热门标签

最新资料