Introduction to Boosted Trees  
Tianqi Chen 
Oct. 22 2014 
Outline 
• Review of key concepts of supervised learning 
 
• Regression Tree and Ensemble (What are we Learning) 
 
• Gradient Boosting (How do we Learn) 
 
• Summary  
Elements in Supervised Learning  
• Notations:                 i-th training example  
• Model: how to make prediction      given 
 Linear model:                             (include linear/logistic regression) 
 The prediction score       can have different interpretations 
depending on the task 
 Linear regression:      is the predicted score 
 Logistic regression:                                   is predicted the probability 
of the instance being positive 
 Others… for example in ranking       can be the rank score   
• Parameters: the things we need to learn from data 
 Linear model:  
Elements continued: Objective Function 
• Objective function that is everywhere 
 
 
 
Training Loss measures how  
well model fit on training data 
Regularization, measures  
complexity of model 
• Loss on training data:  
 Square loss:  
 Logistic loss: 
• Regularization: how complicated the model is? 
 L2 norm:   
 L1 norm (lasso):  
 
 
Putting known knowledge into context 
• Ridge regression: 
 Linear model, square loss, L2 regularization 
• Lasso: 
 Linear model, square loss, L1 regularization 
• Logistic regression:  
 
 Linear model, logistic loss, L2 regularization 
• The conceptual separation between model, parameter, 
objective also gives you engineering benefits. 
 Think of how you can implement SGD for both ridge regression 
and logistic regression 
 
 
Objective and Bias Variance Trade-off 
 
 
 
Training Loss measures how  
well model fit on training data 
Regularization, measures  
complexity of model 
• Why do we want to contain two component in the objective? 
• Optimizing training loss encourages predictive models 
 Fitting well in training data at least get you close to training data 
which is hopefully close to the underlying distribution 
• Optimizing regularization encourages simple models 
 Simpler models tends to have smaller variance in future 
predictions, making prediction stable 
Outline 
 
• Review of key concepts of supervised learning 
 
• Regression Tree and Ensemble (What are we Learning) 
 
• Gradient Boosting (How do we Learn) 
 
• Summary  
 
Regression Tree (CART) 
• regression tree (also known as classification and regression 
tree): 
 Decision rules same as in decision tree 
 Contains one score in each leaf value 
Input: age, gender, occupation, … 
Does the person like computer games 
age < 15 
Y 
N 
is male? 
Y 
N 
prediction score in each leaf 
+2 
+0.1 
-1