logo资料库

论文研究 - 使用增强的回归树和遥感数据来推动决策.pdf

第1页 / 共17页
第2页 / 共17页
第3页 / 共17页
第4页 / 共17页
第5页 / 共17页
第6页 / 共17页
第7页 / 共17页
第8页 / 共17页
资料共17页,剩余部分请下载后查看
Using Boosted Regression Trees and Remotely Sensed Data to Drive Decision-Making
Abstract
Keywords
1. Background
2. Introduction
2.1. Case Study
2.2. Data-Related Challenges
3. Boosted Regression Trees
Gradient Boosting
4. Results
5. Discussion
Acknowledgements
References
Open Journal of Statistics, 2017, 7, 859-875 http://www.scirp.org/journal/ojs ISSN Online: 2161-7198 ISSN Print: 2161-718X Using Boosted Regression Trees and Remotely Sensed Data to Drive Decision-Making Brigitte Colin, Samuel Clifford, Paul Wu, Samuel Rathmanner, Kerrie Mengersen School of Mathematical Sciences, Queensland University of Technology, Brisbane, Australia How to cite this paper: Colin, B., Clifford, S., Wu, P., Rathmanner, S. and Mengersen, K. (2017) Using Boosted Regression Trees and Remotely Sensed Data to Drive Deci- sion-Making. Open Journal of Statistics, 7, 859-875. https://doi.org/10.4236/ojs.2017.75061 Received: September 27, 2017 Accepted: October 28, 2017 Published: October 31, 2017 Copyright © 2017 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY 4.0). http://creativecommons.org/licenses/by/4.0/ Open Access Abstract Challenges in Big Data analysis arise due to the way the data are recorded, maintained, processed and stored. We demonstrate that a hierarchical, multi- variate, statistical machine learning algorithm, namely Boosted Regression Tree (BRT) can address Big Data challenges to drive decision making. The challenge of this study is lack of interoperability since the data, a collection of GIS shapefiles, remotely sensed imagery, and aggregated and interpolated spatio-temporal information, are stored in monolithic hardware components. For the modelling process, it was necessary to create one common input file. By merging the data sources together, a structured but noisy input file, show- ing inconsistencies and redundancies, was created. Here, it is shown that BRT can process different data granularities, heterogeneous data and missingness. In particular, BRT has the advantage of dealing with missing data by default by allowing a split on whether or not a value is missing as well as what the value is. Most importantly, the BRT offers a wide range of possibilities re- garding the interpretation of results and variable selection is automatically performed by considering how frequently a variable is used to define a split in the tree. A comparison with two similar regression models (Random Forests and Least Absolute Shrinkage and Selection Operator, LASSO) shows that BRT outperforms these in this instance. BRT can also be a starting point for sophisticated hierarchical modelling in real world scenarios. For example, a single or ensemble approach of BRT could be tested with existing models in order to improve results for a wide range of data-driven decisions and appli- cations. Keywords Boosted Regression Trees, Remotely Sensed Data, Big Data Modelling Approach, Missing Data DOI: 10.4236/ojs.2017.75061 Oct. 31, 2017 859 Open Journal of Statistics
B. Colin et al. DOI: 10.4236/ojs.2017.75061 1. Background Data are typically stored in various ways and various formats, mostly in mono- lithic software architectures which do not allow for interoperability. Analysis of data across multiple data sources is thus difficult, since the functionality of the single data sources with respect to input and output, maintenance, data processing, error handling and user interface is all interwoven and acts as archi- tecturally separate components. In order to create a basis for analysing the data considered here, it was required to extract the datasets from their original data- bases and combine them to form a common input file for the modelling process. It was therefore inevitable that this resulted in a data file structure which showed missing data, inconsistencies, duplicates and redundancies. A case study is presented here to examine land use data sourced from a GIS, direct observations from an agricultural company, and remotely sensed data. The data were extracted from a relational database, Excel spreadsheets, remotely sensed imagery stored as raster data, and vector data from a Geographic Infor- mation System (GIS), directly observed and measured data in real-time and in- terpolated data. By combining these data sources to form one common basis for our analysis, issues of data volume, variety and veracity were encountered. Big Data research clearly deals with issues beyond volume and belongs not only to the ongoing digital revolution, but to the scientific revolution as well. The ques- tion posed of Big Data and illustrated in the case study presented here, is wheth- er new knowledge can be extracted from various data sources that haven’t been analysed in combination before, and can thus assist in a better and more confi- dent decision making. 2. Introduction There is an exponential increase in interest in the use of digital data to improve decision making in a range of areas such as human systems, urban environ- ments, agriculture and national security. For example, decisions in the agricul- tural domain may require information based on vegetation or land use change, estimation of crops or biomass, distribution of native or exotic species, livestock or weed assessment and so on. One source of digital data that has generated in- tense interest over the past decades is remotely sensed imagery. These data are available from a wide range of sources, ranging from satellites to drones, and have been used for a very wide range of environmental applications [1]-[8]. The availability and resolution of these data, combined with improved com- puter storage and data management facilities, have greatly increased the oppor- tunity for mathematicians and statisticians to utilise this information in their models and analyses. The challenge in linking remotely sensed data to decision- making is that there are multiple steps in the process. Here, we focus on an ex- emplar real-world problem in the livestock industry: deciding on the allocation of animals to different paddocks and potentially different grazing properties based on the predicted availability of grass over the year. This problem arose in 860 Open Journal of Statistics
B. Colin et al. the context of collaboration between statisticians at the Queensland University of Technology and a large livestock organisation in Australia. The specific aim of the project was to develop an ensemble of models to predict the carrying capaci- ty, that is, the number of animals that can be sustained on a paddock. In order to achieve this goal we utilised remote sensing data and supporting information about climate and paddock characteristics. Further, it was important to present the results in a form that is useful for the agricultural decision makers. Difficult or challenging decisions demand a thorough consideration and even then they imply uncertainty, complexity and different levels of risk. Making the right decisions at the right time can lead to success, increase of profit or mini- misation of risk. It is thus important that thoughtful considerations are put into each decision. Figure 1 demonstrates the workflow following a Big Data ap- proach for our case study. Here, we use structured but heterogeneous data sources that showed characteristics like missing data, noise and redundancies. All the data sources were used to create a BRT model via an ensemble approach. The resulting model and its output serves as a foundation for a better decision making. The steps involved in the process are depicted in Figure 1. Due to commercial confidentiality concerns, the final results of the modelling workflow are not presented here. In this article we focus on one component of the ensemble modelling ap- proach employed in the project, namely the use of BRT to estimate so-called animal equivalents per paddock. Since calves, cows and bulls of different ages consume different amounts of grass, these animals are standardised to a refer- ence animal which can then be used as a common response variable in the anal- ysis. An interesting conundrum is that one of the major inputs into such a model is the amount of grass, or more generally the biomass, in a paddock. This can potentially be estimated directly from remote sensing, but is confounded by the fact that animals are on the paddock eating the very thing that is being measured by the sensor. Moreover, the decision maker may be interested in the biomass estimates themselves, either directly via the remotely sensed measurements or indirectly via the animal equivalents based on animal weight and metabolic formula. A BRT is a popular statistical and machine learning approach that has not yet seen much application in the analysis of remotely sensed data. Indeed, although they were first defined two decades ago, BRT has only recently been extended to deal with the types of features that are characteristic of remotely sensed data, in particular its spatial and temporal dynamics. Most of the activity around the use of BRT for agricultural and environmental applications does not appear in the mainstream mathematical and statistical literature. Figure 1. Modelling process for case study. 861 Open Journal of Statistics DOI: 10.4236/ojs.2017.75061
B. Colin et al. DOI: 10.4236/ojs.2017.75061 2.1. Case Study The study area is located in the Northern Territory, Australia. The main climate zone is identified as grassland with hot dry summers and mild winters [9]. It is a heterogeneous region with a complex topography and land cover and type of grassland. Identification, differentiation and quantitative estimation of biomass is of primary interest in this case study. A range of data from different sources was required for this problem. In this section, we describe the information de- rived from Landsat imagery and comment briefly on other data. The reflectance recorded by the Landsat sensor is stored as an 8 bit value, resulting in a scale of 256 different grey values ranging from black (0 max absorption) to white (255 max reflection). The electronically recorded data appear as an array of numbers in digital format. In addition to the 8 bit quantisation, Landsat offers several spectral bands in the electromagnetic and infrared spectrum in which each indi- vidual pixel shows different values across different bands. This means that each pixel has a different dimension and therefore will be represented differently in each spectral band. Raster data are becoming increasingly common and increa- singly large in volume, although it is possible to reduce file size with compres- sion functions. There is a strong advantage in using remotely sensed Landsat imagery and applied spectroscopy for these types of analyses because the data are freely available, the imagery covers a wide geographical range, and it avoids expensive, extensive and often impractical in-situ measurement. However, the trade-off is in resolution: in-situ measurements provide highly localised accuracy whereas a pixel in a Landsat image covers an area of 30 × 30 meters. It is noted that other satellites are now able to provide higher resolution, but these are not yet freely available for the areas of interest in this case study. Estimation of biomass using satellite data is of ongoing global interest. Grass biomass estimation is challenging since the phenological growing cycle of natu- rally existing grass is a dynamic process influenced by many complex parame- ters, including grass type, soil, climate, topography and land use. With the spec- tral information of remotely sensed imagery it is possible to detect green vegeta- tion, which is driven by the photosynthetic biochemical process of grass bio- mass. However, since raster imagery is only a two dimensional representation of the land cover it is difficult to derive the quantity of the vertical grass biomass directly. Fractional cover [10] data are often available as derived products; for example Geoscience Australia (GA) who provides an Australian Reflectance Grid 25 (ARG25) product which gives a 25 meter scale fractional cover representation of underlying vegetation across Australia or Tern - Auscover in 30 meter resolution of Landsat 5 and 7 covering the temporal extent from 2000-2011. Fractional cover unmixing algorithms use the spectral reflectance of a Landsat scene for a pixel to break it into three fractions represented as percentage values. These are photosynthetic vegetation (includes leaves and grass), non-photosynthetic vege- 862 Open Journal of Statistics
B. Colin et al. tation (includes branches, dry grass, and dead leaf litter) and bare surface cover (bare soil or rocks) [11]. In addition to fractional cover Vegetation Indices (VI) are commonly used to extract meaningful information out of the imagery through image analysis tech- niques. To calculate VIs it is common to apply arithmetical methods in order to create additional artificial channels using existing spectral bands of the imagery. Other related data were also available to support the analyses. For example, SILO (Scientific Information for Land Owners) is a database of historical climate records for Australia. SILO provides daily datasets for a range of climate va- riables and in formats suitable for a variety of applications. In addition, SILO datasets are constructed from observational records provided by the Bureau of Meteorology (BOM). As another example, the AussieGRASS spatial framework includes inputs of key climate variables (rainfall, evaporation, temperature, va- pour pressure and solar radiation), soil and pasture types, tree and shrub cover, domestic livestock and other herbivore numbers. The derived results of Aussie- GRASS data are spatially interpolated to construct gridded datasets on a regular grid (approximately 5 × 5 km) across Australia [12] [13]. 2.2. Data-Related Challenges The analysis of relationships in ecological data sets is not trivial [14]. In addition to the complexity of the processes being modelled, there is the challenge of deal- ing with data dimensionality since it is often necessary to combine various data sources. Moreover, the scale of spatial data needs to be considered when there are differing granularities of spatial and temporal data. For example, SILO rain- fall data are reported at a 5 × 5 km grid, whereas a Landsat pixel covers an area of 30 × 30 meter. The SILO data are stored in a tabular data base format and the single measurement points to record the precipitation independently from each other. In contrast, the derived VI cover a whole Landsat scene of 185 × 185 km and are highly correlated. All our environmental data have been provided from the Department of Science, Information Technology and Innovation (DSITI). In addition to the environmental data we used operational data provided by a commercial entity under a confidential agreement. Another challenging characteristic of remotely sensed data is missing infor- mation. There are two major considerations in dealing with this issue. The first is dealing with the missing values. Common options are to filter them out [15] [16], interpolate them or increase the spatial aggregation. There are advantages and disadvantages to each of these approaches in terms of computational re- sources, inferential capability, and precision and bias of the resultant estimates [17]. The second consideration is whether to undertake the chosen method as part of the pre-processing or post-processing steps. For our case study we performed a number of pre-processing steps to prepare our data for the modelling process, namely data aggregation and data reduction for our predictor variables, as well as calculation of the response variable. Instead 863 Open Journal of Statistics DOI: 10.4236/ojs.2017.75061
B. Colin et al. of working with single pixel values we reduced the volume of data by deriving descriptive statistics from Landsat, MODIS and SILO data, thereby obtaining paddock specific means, medians, first quartile, third quartile, variance and Shannon Entropy. With respect to our response variable, we aggregated real- time measurements to a monthly mean. In the next step we created a test and a training data set by partitioning the data to 20% and 80% respectively. The training set was used to estimate the model parameters. The test set was used for model performance evaluation on unseen data. 3. Boosted Regression Trees Boosted Regression Trees (BRT), also known as Gradient Boosted Machine (GBM) or Stochastic Gradient Boosting (SGB), are non-parametric regression techniques that combine a regression tree with a boosting algorithm [18]. This extension to the classical regression tree allows greater flexibility and predictive performance in modelling the data. The implementation of these methods used in this study can be found in the gbm R package. A regression tree partitions the data with a hierarchy of binary splits that de- fine regions of the covariate space in which the response variable has similar values. These splits are defined by rules, distance metrics or information gain. The choice of variables and the value at which the split point occurs is deter- mined in a recursive manner at each stage of the tree construction. The segmen- tation can be depicted as a tree-like structure, comprising nodes representing the selected factors, branches acting as if-else connectors between the nodes, and leaves representing terminal nodes containing the subsets of responses [19] [16] [20]. Boosting improves the performance of a simple base-learner by reweighting observations that were misclassified or had large residual errors in the previous iteration. The deeper we grow the tree, the more segments we can accommodate and thus more variance can be explained. This results in higher model complex- ity and therefore higher risk of overfitting the model to the data. The motivation behind Boosting is that each tree can be quite shallow (a weak classifier) and thus fast to estimate, but by combining the predictive power of many weak classifiers, a classifier of arbitrary accuracy and precision can be created [21] [22] [23]. Gradient Boosting In this section we give a brief summary of the method, following Friedman [18]. This supervised machine learning approach deals with a response variables y and a vector of predictor variables x that are connected via a joint probability dis- tribution of known values of x and corresponding values of y, the goal is to find an approximation ( *F x that minimises the expected value of a loss func- F x to a function tion , i.e. . Using a training sample ( yx ,P ) ,y Fψ ( { ( ,  , x ,n x 1 , } ) ) ) ( x ) y 1 ( ) ) ( y n DOI: 10.4236/ojs.2017.75061 864 Open Journal of Statistics
B. Colin et al. ( x ) ) . (1) E y , y Fψ x , ( * F ( x ) = arg min F ( x ) Boosting approximates *F x by an “additive” expansion in the form of ( ) F ( M ) ∑x m = hβ m 0 ( x a ; m ) , (2) , ; ) a = { a a , 1 2 ( h x a are generally simple functions of x with para- where the functions . The parameters { } meters and the expansion coefficients  { }0 M mβ are jointly fit to the training data. This is done in a forward stage wise manner. Gradient Boosting [18] approximately solves differentiable loss func- ( h x a is fit tions by least squares to the current “pseudo”-residuals with a two step procedure. First, the function ,y Fψ }0 ma x ( ) ( ) ) M ; y  im  = −    ( y F , ∂ ψ i ( x F ∂ i ( x i ) ) )     F ( x ) = F m 1 − ( x ) (3) which represent the residuals from the given stage of the tree building. Then, given via ( h x a ; m ) , the optimal value of the coefficient mβ is calculated β m = arg min β N ( ∑ ψ 1 = i y F i m , 1 − ( x i ) + h β ( x a ; i m ) ) . (4) Gradient Tree Boosting performs this with a base learner ( h x a of an L terminal node regression tree. A regression tree partitions the feature space into L disjoint regions { } 1 L lm lR − and predicts a separate constant value at each itera- tion m. ) ; ( h x ; { R lm } ) L 1 = L ∑ l 1 − y lm 1 ( x ∈ R lm ) . (5) L }1 The parameters of the base learner are the splitting variables and correspond- ing split points that define the tree, and this defines the corresponding regions { lmR of the partition at each iteration. These are accomplished in a top-down “best-first” approach using a least squares splitting measure [18]. Equation 4 can lmR defined by the corresponding be solved individually within each region terminal node l of the mth tree. Because the tree in Equation (5) predicts a lmR , the solution to 4 reduces to a sim- constant value ple location estimate based on the criterion ψ y F i m lmy within each region (6) arg min ) γ γ lm x i = + ( ) 1 − , . ∑ ( ψ R lm ∈ x i γ Next, the current approximation 1mF − x is individually updated in all of ( ) the corresponding regions ( x F m ) = F m 1 − ( x ) + ⋅ ν γ lm 1 ( x ∈ R lm ) . (7) Friedman [18] added a stochastic element to the above boosting algorithm by proposing to draw a random subsample from the full training data set without 865 Open Journal of Statistics DOI: 10.4236/ojs.2017.75061
B. Colin et al. } N i i } { ( )  N , x π ( ) i n features x y i i , 0 ≤ < i y x i i , and n samples { }N ( ) iπ , and the response , N . The random subsample of size N N< replacement. This subsample is then used to fit the base learner and compute the model update for the current iteration. By adding randomness to the algorithm the performance of gradient boosting was improved and this resulted in the sto- chastic Gradient Boosting Machine (GBM) [23]. The Stochastic Gradient Boost- ing algorithm is summarised as pseudo code below [15] [23]. The input training data is defined through { is the random permutation of the integers 1, is given by { }1 yπ . ( ) i 4. Results with feature vector The data were presented as a set iy ∈ . All the data we used for our case study i ∈x  were combined into a structured comma-separated values (CSV) file that con- sisted of 209 observations and 141 covariates. The machine friendly notation of our covariates are generated in the following manner. There are in total 5 dif- ferent components for creating the covariate names. The first shows whether the calculated summary statistics are for monthly values of EOLW/D = end of last wet/dry, or WS = wet season; these are then followed by whether it is an aggre- gated mean, minimum red or maximum monthly values, followed by the nature of the descriptive statistic: first quartile, median, mean, third quartile, variance and Shannon Entropy; next comes the name of data source (e.g. rain = SILO da- ta), and lastly the corresponding area in proximity to water (3 km, 5 km, 99 km = whole paddock). The covariate name of paha.99km/5km stores values for the whole paddock area measured in hectare and the proximity of water e.g. 5 km radius or 99 km for the whole extent of the paddock. As described in 2.2, the da- ta set was partitioned by treating 80% as training data and the remaining 20% as DOI: 10.4236/ojs.2017.75061 866 Open Journal of Statistics
分享到:
收藏