Open Journal of Statistics, 2017, 7, 859-875
http://www.scirp.org/journal/ojs
ISSN Online: 2161-7198
ISSN Print: 2161-718X
Using Boosted Regression Trees and Remotely
Sensed Data to Drive Decision-Making
Brigitte Colin, Samuel Clifford, Paul Wu, Samuel Rathmanner, Kerrie Mengersen
School of Mathematical Sciences, Queensland University of Technology, Brisbane, Australia
How to cite this paper: Colin, B., Clifford,
S., Wu, P., Rathmanner, S. and Mengersen,
K. (2017) Using Boosted Regression Trees
and Remotely Sensed Data to Drive Deci-
sion-Making. Open Journal of Statistics, 7,
859-875.
https://doi.org/10.4236/ojs.2017.75061
Received: September 27, 2017
Accepted: October 28, 2017
Published: October 31, 2017
Copyright © 2017 by authors and
Scientific Research Publishing Inc.
This work is licensed under the Creative
Commons Attribution International
License (CC BY 4.0).
http://creativecommons.org/licenses/by/4.0/
Open Access
Abstract
Challenges in Big Data analysis arise due to the way the data are recorded,
maintained, processed and stored. We demonstrate that a hierarchical, multi-
variate, statistical machine learning algorithm, namely Boosted Regression
Tree (BRT) can address Big Data challenges to drive decision making. The
challenge of this study is lack of interoperability since the data, a collection of
GIS shapefiles, remotely sensed imagery, and aggregated and interpolated
spatio-temporal information, are stored in monolithic hardware components.
For the modelling process, it was necessary to create one common input file.
By merging the data sources together, a structured but noisy input file, show-
ing inconsistencies and redundancies, was created. Here, it is shown that BRT
can process different data granularities, heterogeneous data and missingness.
In particular, BRT has the advantage of dealing with missing data by default
by allowing a split on whether or not a value is missing as well as what the
value is. Most importantly, the BRT offers a wide range of possibilities re-
garding the interpretation of results and variable selection is automatically
performed by considering how frequently a variable is used to define a split in
the tree. A comparison with two similar regression models (Random Forests
and Least Absolute Shrinkage and Selection Operator, LASSO) shows that
BRT outperforms these in this instance. BRT can also be a starting point for
sophisticated hierarchical modelling in real world scenarios. For example, a
single or ensemble approach of BRT could be tested with existing models in
order to improve results for a wide range of data-driven decisions and appli-
cations.
Keywords
Boosted Regression Trees, Remotely Sensed Data, Big Data Modelling Approach,
Missing Data
DOI: 10.4236/ojs.2017.75061 Oct. 31, 2017
859
Open Journal of Statistics
B. Colin et al.
DOI: 10.4236/ojs.2017.75061
1. Background
Data are typically stored in various ways and various formats, mostly in mono-
lithic software architectures which do not allow for interoperability. Analysis of
data across multiple data sources is thus difficult, since the functionality of the
single data sources with respect to input and output, maintenance, data
processing, error handling and user interface is all interwoven and acts as archi-
tecturally separate components. In order to create a basis for analysing the data
considered here, it was required to extract the datasets from their original data-
bases and combine them to form a common input file for the modelling process.
It was therefore inevitable that this resulted in a data file structure which showed
missing data, inconsistencies, duplicates and redundancies.
A case study is presented here to examine land use data sourced from a GIS,
direct observations from an agricultural company, and remotely sensed data.
The data were extracted from a relational database, Excel spreadsheets, remotely
sensed imagery stored as raster data, and vector data from a Geographic Infor-
mation System (GIS), directly observed and measured data in real-time and in-
terpolated data. By combining these data sources to form one common basis for
our analysis, issues of data volume, variety and veracity were encountered. Big
Data research clearly deals with issues beyond volume and belongs not only to
the ongoing digital revolution, but to the scientific revolution as well. The ques-
tion posed of Big Data and illustrated in the case study presented here, is wheth-
er new knowledge can be extracted from various data sources that haven’t been
analysed in combination before, and can thus assist in a better and more confi-
dent decision making.
2. Introduction
There is an exponential increase in interest in the use of digital data to improve
decision making in a range of areas such as human systems, urban environ-
ments, agriculture and national security. For example, decisions in the agricul-
tural domain may require information based on vegetation or land use change,
estimation of crops or biomass, distribution of native or exotic species, livestock
or weed assessment and so on. One source of digital data that has generated in-
tense interest over the past decades is remotely sensed imagery. These data are
available from a wide range of sources, ranging from satellites to drones, and
have been used for a very wide range of environmental applications [1]-[8].
The availability and resolution of these data, combined with improved com-
puter storage and data management facilities, have greatly increased the oppor-
tunity for mathematicians and statisticians to utilise this information in their
models and analyses. The challenge in linking remotely sensed data to decision-
making is that there are multiple steps in the process. Here, we focus on an ex-
emplar real-world problem in the livestock industry: deciding on the allocation
of animals to different paddocks and potentially different grazing properties
based on the predicted availability of grass over the year. This problem arose in
860
Open Journal of Statistics
B. Colin et al.
the context of collaboration between statisticians at the Queensland University
of Technology and a large livestock organisation in Australia. The specific aim of
the project was to develop an ensemble of models to predict the carrying capaci-
ty, that is, the number of animals that can be sustained on a paddock. In order to
achieve this goal we utilised remote sensing data and supporting information
about climate and paddock characteristics. Further, it was important to present
the results in a form that is useful for the agricultural decision makers.
Difficult or challenging decisions demand a thorough consideration and even
then they imply uncertainty, complexity and different levels of risk. Making the
right decisions at the right time can lead to success, increase of profit or mini-
misation of risk. It is thus important that thoughtful considerations are put into
each decision. Figure 1 demonstrates the workflow following a Big Data ap-
proach for our case study. Here, we use structured but heterogeneous data
sources that showed characteristics like missing data, noise and redundancies.
All the data sources were used to create a BRT model via an ensemble approach.
The resulting model and its output serves as a foundation for a better decision
making. The steps involved in the process are depicted in Figure 1. Due to
commercial confidentiality concerns, the final results of the modelling workflow
are not presented here.
In this article we focus on one component of the ensemble modelling ap-
proach employed in the project, namely the use of BRT to estimate so-called
animal equivalents per paddock. Since calves, cows and bulls of different ages
consume different amounts of grass, these animals are standardised to a refer-
ence animal which can then be used as a common response variable in the anal-
ysis. An interesting conundrum is that one of the major inputs into such a model
is the amount of grass, or more generally the biomass, in a paddock. This can
potentially be estimated directly from remote sensing, but is confounded by the
fact that animals are on the paddock eating the very thing that is being measured
by the sensor. Moreover, the decision maker may be interested in the biomass
estimates themselves, either directly via the remotely sensed measurements or
indirectly via the animal equivalents based on animal weight and metabolic
formula.
A BRT is a popular statistical and machine learning approach that has not yet
seen much application in the analysis of remotely sensed data. Indeed, although
they were first defined two decades ago, BRT has only recently been extended to
deal with the types of features that are characteristic of remotely sensed data, in
particular its spatial and temporal dynamics. Most of the activity around the use
of BRT for agricultural and environmental applications does not appear in the
mainstream mathematical and statistical literature.
Figure 1. Modelling process for case study.
861
Open Journal of Statistics
DOI: 10.4236/ojs.2017.75061
B. Colin et al.
DOI: 10.4236/ojs.2017.75061
2.1. Case Study
The study area is located in the Northern Territory, Australia. The main climate
zone is identified as grassland with hot dry summers and mild winters [9]. It is a
heterogeneous region with a complex topography and land cover and type of
grassland. Identification, differentiation and quantitative estimation of biomass
is of primary interest in this case study. A range of data from different sources
was required for this problem. In this section, we describe the information de-
rived from Landsat imagery and comment briefly on other data. The reflectance
recorded by the Landsat sensor is stored as an 8 bit value, resulting in a scale of
256 different grey values ranging from black (0 max absorption) to white (255
max reflection). The electronically recorded data appear as an array of numbers
in digital format. In addition to the 8 bit quantisation, Landsat offers several
spectral bands in the electromagnetic and infrared spectrum in which each indi-
vidual pixel shows different values across different bands. This means that each
pixel has a different dimension and therefore will be represented differently in
each spectral band. Raster data are becoming increasingly common and increa-
singly large in volume, although it is possible to reduce file size with compres-
sion functions.
There is a strong advantage in using remotely sensed Landsat imagery and
applied spectroscopy for these types of analyses because the data are freely
available, the imagery covers a wide geographical range, and it avoids expensive,
extensive and often impractical in-situ measurement. However, the trade-off is
in resolution: in-situ measurements provide highly localised accuracy whereas a
pixel in a Landsat image covers an area of 30 × 30 meters. It is noted that other
satellites are now able to provide higher resolution, but these are not yet freely
available for the areas of interest in this case study.
Estimation of biomass using satellite data is of ongoing global interest. Grass
biomass estimation is challenging since the phenological growing cycle of natu-
rally existing grass is a dynamic process influenced by many complex parame-
ters, including grass type, soil, climate, topography and land use. With the spec-
tral information of remotely sensed imagery it is possible to detect green vegeta-
tion, which is driven by the photosynthetic biochemical process of grass bio-
mass. However, since raster imagery is only a two dimensional representation of
the land cover it is difficult to derive the quantity of the vertical grass biomass
directly.
Fractional cover [10] data are often available as derived products; for example
Geoscience Australia (GA) who provides an Australian Reflectance Grid 25
(ARG25) product which gives a 25 meter scale fractional cover representation of
underlying vegetation across Australia or Tern - Auscover in 30 meter resolution
of Landsat 5 and 7 covering the temporal extent from 2000-2011. Fractional
cover unmixing algorithms use the spectral reflectance of a Landsat scene for a
pixel to break it into three fractions represented as percentage values. These are
photosynthetic vegetation (includes leaves and grass), non-photosynthetic vege-
862
Open Journal of Statistics
B. Colin et al.
tation (includes branches, dry grass, and dead leaf litter) and bare surface cover
(bare soil or rocks) [11].
In addition to fractional cover Vegetation Indices (VI) are commonly used to
extract meaningful information out of the imagery through image analysis tech-
niques. To calculate VIs it is common to apply arithmetical methods in order to
create additional artificial channels using existing spectral bands of the imagery.
Other related data were also available to support the analyses. For example, SILO
(Scientific Information for Land Owners) is a database of historical climate
records for Australia. SILO provides daily datasets for a range of climate va-
riables and in formats suitable for a variety of applications. In addition, SILO
datasets are constructed from observational records provided by the Bureau of
Meteorology (BOM). As another example, the AussieGRASS spatial framework
includes inputs of key climate variables (rainfall, evaporation, temperature, va-
pour pressure and solar radiation), soil and pasture types, tree and shrub cover,
domestic livestock and other herbivore numbers. The derived results of Aussie-
GRASS data are spatially interpolated to construct gridded datasets on a regular
grid (approximately 5 × 5 km) across Australia [12] [13].
2.2. Data-Related Challenges
The analysis of relationships in ecological data sets is not trivial [14]. In addition
to the complexity of the processes being modelled, there is the challenge of deal-
ing with data dimensionality since it is often necessary to combine various data
sources. Moreover, the scale of spatial data needs to be considered when there
are differing granularities of spatial and temporal data. For example, SILO rain-
fall data are reported at a 5 × 5 km grid, whereas a Landsat pixel covers an area
of 30 × 30 meter. The SILO data are stored in a tabular data base format and the
single measurement points to record the precipitation independently from each
other. In contrast, the derived VI cover a whole Landsat scene of 185 × 185 km
and are highly correlated. All our environmental data have been provided from
the Department of Science, Information Technology and Innovation (DSITI). In
addition to the environmental data we used operational data provided by a
commercial entity under a confidential agreement.
Another challenging characteristic of remotely sensed data is missing infor-
mation. There are two major considerations in dealing with this issue. The first
is dealing with the missing values. Common options are to filter them out [15]
[16], interpolate them or increase the spatial aggregation. There are advantages
and disadvantages to each of these approaches in terms of computational re-
sources, inferential capability, and precision and bias of the resultant estimates
[17]. The second consideration is whether to undertake the chosen method as
part of the pre-processing or post-processing steps.
For our case study we performed a number of pre-processing steps to prepare
our data for the modelling process, namely data aggregation and data reduction
for our predictor variables, as well as calculation of the response variable. Instead
863
Open Journal of Statistics
DOI: 10.4236/ojs.2017.75061
B. Colin et al.
of working with single pixel values we reduced the volume of data by deriving
descriptive statistics from Landsat, MODIS and SILO data, thereby obtaining
paddock specific means, medians, first quartile, third quartile, variance and
Shannon Entropy. With respect to our response variable, we aggregated real-
time measurements to a monthly mean. In the next step we created a test and a
training data set by partitioning the data to 20% and 80% respectively. The
training set was used to estimate the model parameters. The test set was used for
model performance evaluation on unseen data.
3. Boosted Regression Trees
Boosted Regression Trees (BRT), also known as Gradient Boosted Machine
(GBM) or Stochastic Gradient Boosting (SGB), are non-parametric regression
techniques that combine a regression tree with a boosting algorithm [18]. This
extension to the classical regression tree allows greater flexibility and predictive
performance in modelling the data. The implementation of these methods used
in this study can be found in the gbm R package.
A regression tree partitions the data with a hierarchy of binary splits that de-
fine regions of the covariate space in which the response variable has similar
values. These splits are defined by rules, distance metrics or information gain.
The choice of variables and the value at which the split point occurs is deter-
mined in a recursive manner at each stage of the tree construction. The segmen-
tation can be depicted as a tree-like structure, comprising nodes representing the
selected factors, branches acting as if-else connectors between the nodes, and
leaves representing terminal nodes containing the subsets of responses [19] [16]
[20].
Boosting improves the performance of a simple base-learner by reweighting
observations that were misclassified or had large residual errors in the previous
iteration. The deeper we grow the tree, the more segments we can accommodate
and thus more variance can be explained. This results in higher model complex-
ity and therefore higher risk of overfitting the model to the data.
The motivation behind Boosting is that each tree can be quite shallow (a weak
classifier) and thus fast to estimate, but by combining the predictive power of
many weak classifiers, a classifier of arbitrary accuracy and precision can be
created [21] [22] [23].
Gradient Boosting
In this section we give a brief summary of the method, following Friedman [18].
This supervised machine learning approach deals with a response variables y and
a vector of predictor variables x that are connected via a joint probability dis-
tribution
of known
values of x and corresponding values of y, the goal is to find an approximation
(
*F x that minimises the expected value of a loss func-
F x to a function
tion
, i.e.
. Using a training sample
(
yx
,P
)
,y Fψ
(
{
(
,
,
x
,n
x
1
,
}
)
)
)
(
x
)
y
1
(
)
)
(
y
n
DOI: 10.4236/ojs.2017.75061
864
Open Journal of Statistics
B. Colin et al.
(
x
)
)
.
(1)
E
y
,
y Fψ
x
,
(
*
F
(
x
)
=
arg min
F
(
x
)
Boosting approximates
*F x by an “additive” expansion in the form of
(
)
F
(
M
)
∑x
m
=
hβ
m
0
(
x a
;
m
)
,
(2)
,
;
)
a
=
{
a a
,
1
2
(
h x a are generally simple functions of x with para-
where the functions
. The parameters {
}
meters
and the expansion coefficients
{
}0
M
mβ are jointly fit to the training data. This is done in a forward stage wise
manner. Gradient Boosting [18] approximately solves differentiable loss func-
(
h x a is fit
tions
by least squares to the current “pseudo”-residuals
with a two step procedure. First, the function
,y Fψ
}0
ma
x
(
)
(
)
)
M
;
y
im
= −
(
y F
,
∂
ψ
i
(
x
F
∂
i
(
x
i
)
)
)
F
(
x
)
=
F
m
1
−
(
x
)
(3)
which represent the residuals from the given stage of the tree building.
Then, given
via
(
h x a
; m
)
, the optimal value of the coefficient
mβ is calculated
β
m
=
arg min
β
N
(
∑
ψ
1
=
i
y F
i
m
,
1
−
(
x
i
)
+
h
β
(
x a
;
i
m
)
)
.
(4)
Gradient Tree Boosting performs this with a base learner
(
h x a of an L
terminal node regression tree. A regression tree partitions the feature space into
L disjoint regions {
} 1
L
lm lR
− and predicts a separate constant value at each itera-
tion m.
)
;
(
h
x
;
{
R
lm
}
)
L
1
=
L
∑
l
1
−
y
lm
1
(
x
∈
R
lm
)
.
(5)
L
}1
The parameters of the base learner are the splitting variables and correspond-
ing split points that define the tree, and this defines the corresponding regions
{
lmR
of the partition at each iteration. These are accomplished in a top-down
“best-first” approach using a least squares splitting measure [18]. Equation 4 can
lmR defined by the corresponding
be solved individually within each region
terminal node l of the mth tree. Because the tree in Equation (5) predicts a
lmR , the solution to 4 reduces to a sim-
constant value
ple location estimate based on the criterion ψ
y F
i
m
lmy within each region
(6)
arg min
)
γ
γ
lm
x
i
=
+
(
)
1
−
,
.
∑
(
ψ
R
lm
∈
x
i
γ
Next, the current approximation
1mF − x is individually updated in all of
(
)
the corresponding regions
(
x
F
m
)
=
F
m
1
−
(
x
)
+ ⋅
ν γ
lm
1
(
x
∈
R
lm
)
.
(7)
Friedman [18] added a stochastic element to the above boosting algorithm by
proposing to draw a random subsample from the full training data set without
865
Open Journal of Statistics
DOI: 10.4236/ojs.2017.75061
B. Colin et al.
}
N
i
i
}
{
(
)
N
,
x
π
( )
i
n
features
x y
i
i
,
0
≤ <
i
y x
i
i
,
and
n
samples
{
}N
( )
iπ
, and the response
, N . The random subsample of size N N<
replacement. This subsample is then used to fit the base learner and compute the
model update for the current iteration. By adding randomness to the algorithm
the performance of gradient boosting was improved and this resulted in the sto-
chastic Gradient Boosting Machine (GBM) [23]. The Stochastic Gradient Boost-
ing algorithm is summarised as pseudo code below [15] [23]. The input training
data is defined through {
is the random permutation of
the integers 1,
is given by
{
}1
yπ
.
( )
i
4. Results
with feature vector
The data were presented as a set
iy ∈ . All the data we used for our case study
i ∈x
were combined into a structured comma-separated values (CSV) file that con-
sisted of 209 observations and 141 covariates. The machine friendly notation of
our covariates are generated in the following manner. There are in total 5 dif-
ferent components for creating the covariate names. The first shows whether the
calculated summary statistics are for monthly values of EOLW/D = end of last
wet/dry, or WS = wet season; these are then followed by whether it is an aggre-
gated mean, minimum red or maximum monthly values, followed by the nature
of the descriptive statistic: first quartile, median, mean, third quartile, variance
and Shannon Entropy; next comes the name of data source (e.g. rain = SILO da-
ta), and lastly the corresponding area in proximity to water (3 km, 5 km, 99 km
= whole paddock). The covariate name of paha.99km/5km stores values for the
whole paddock area measured in hectare and the proximity of water e.g. 5 km
radius or 99 km for the whole extent of the paddock. As described in 2.2, the da-
ta set was partitioned by treating 80% as training data and the remaining 20% as
DOI: 10.4236/ojs.2017.75061
866
Open Journal of Statistics