logo资料库

大型多元时间序列数据的基于滑动窗口的多阶段聚类和概率预测方法.pdf

第1页 / 共15页
第2页 / 共15页
第3页 / 共15页
第4页 / 共15页
第5页 / 共15页
第6页 / 共15页
第7页 / 共15页
第8页 / 共15页
资料共15页,剩余部分请下载后查看
Journal of Statistical Computation and Simulation F o [SPECIAL ISSUE AsiaSim] A Sliding Window based Multi- stage Clustering and Probabilistic Forecasting Approach for r P Large Multivariate Time Series Data e e r Journal: Journal of Statistical Computation and Simulation Manuscript ID GSCS-2016-0670 Manuscript Type: Original Paper R Areas of Interest: SPECIAL ISSUE (AISIASIM), TIME SERIES e 2010 Mathematics Subject Classification: v i 62H30 e w O n l y URL: http:/mc.manuscriptcentral.com/gscs
Page 1 of 14 Journal of Statistical Computation and Simulation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 A Sliding Window based Multi-stage Clustering and Probabilistic Forecasting Approach for Large Multivariate Time Series Data F Abstract: Time series data analysis, such as temporal pattern recognition and trend forecasting, plays an o r increasingly significant part in temporal data statistics and analytics. Yet challenges still exist in efficiency of pattern extracting and trend prediction for large multivariate time series. The paper proposes a multi-stage clustering approach towards multivariate time series by using dynamic sliding time windows. The segmented multivariate time series are clustered separately in each time window to product first stage clustering centers, and P which are used to generate second stage clustering results involving all time windows. The method can simplify large scale multivariate time series mining problems through multi-stage clustering on multiple sliding time windows thus achieve improved efficiency. Based on the clustering outcomes, a correlation rules mining method is given to discover frequent patterns in the time series and generate self-correlation rules. Then, the paper presents a probabilistic forecasting model that leverages the extracted rules to make short term predictions. e e r Finally, the experiments are presented to show the efficiency and effectiveness of the proposed approach. R Keywords: time series, statistics, multi-stage clustering, dynamic sliding time window, time series forecasting e model 1. Introduction v i e O w A time series is defined as a sequence of data points in successive time order. Most commonly, time series are represented by plots in curves or lines chart. Time series statistics and analysis target at extracting meaningful statistics characteristics to identify the underlying patterns as well as predict future events, which are widely used in signal processing, mathematical finance, earthquake prediction, smart control, and transport trajectory forecasting. With the rising of Big Data era, time series mining and forecasting are playing an increasingly significant role in data statistics and analytics [1]. Time series statistics and analysis mainly concern two goals. One is discovering the nature of the phenomenon represented by the temporal data, and the other is making predictions of future values by building forecasting models. Time series patterns identification plays an essential part in achieving these goals. Most methods of time series mining involve similarity measurement, time series clustering and association rule analysis. Time series patterns mining methods usually vary with different problems that involve specific temporal data with characteristics. Romani [2] proposed a time series mining method for identify association rules between meteorological data and remote sensing information data. The method can extract the characteristic patterns of all kinds of data and generate association rules by training. Li [3] introduced an algorithm for mining feature patterns, which can identify the n l y URL: http:/mc.manuscriptcentral.com/gscs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Journal of Statistical Computation and Simulation Page 2 of 14 F co-evolutional sequences containing missing values or not by exploiting smoothness and correlation. Yao [4] developed a mode dependent stock mining method that can be used for patterns discovery of stock data and trends predicting. In the method, a time search model was used to find frequent patterns in stock data by converting the continuous time series data into discrete space. And the model can make short-term trend forecasting of stock prices by using the patterns. Novak [5] applied soft computing methods to time series data mining, such as fuzzy logic and fuzzy transformation, and used these methods to extract a variety of information, and finally integrated them together by natural language. Regarding the research on time series clustering, Xi [6] presented a clustering method based on singular value decomposition (SVD), which can be used to settle the problem of special structure in time series clustering. This method leverages SVD to obtain memory features of time series and cluster these features by canopy and K-means. Tamura [7] proposed a two-stage clustering method. In the first stage, the one-pass k-median++ method is used to gain the initial clustering centers, but not iterating. And clustering is carried out in the second stage. Xu [8] made a reduction of the dimensions of feature space for ICU data, and segmented the time series by time segmentation technology. This method was applied to prediction of mortality. Wang [9] intercepted the large scale time series into sub ones, and each sub sequence was represented as a group of coarse granularity fuzzy information. As a result, the original problem cab be converted from numeric level to granular level and achieve more effective clustering. r e e o P r R e Regarding the time series forecasting models and methods, Ching-Hsue Cheng [10] built a time series model based on indicators selection through applying multivariate adaptive regression splines and stepwise regression. Further, a forecasting model was established by SVR and optimized by GA. Ching-Hsue Cheng [11] also proposed a rough set rule induction method, which was used to predict the stock price. Winita Sulandari [12] presented a hybrid time series model based on moving average and weighted fuzzy. The model was used to enhance the accuracy of the trend forecasting of time series. Bindu Garg [13] also established a predicting model based on fuzzy time series. The model focused on the influence of trend and seasonal components, and was used to forecast the trend in fuzzy environment. Zhibo Zhu [14] introduced a stock trend analysis method based on feature simplification. The method firstly utilized the local least square method to arrange the feature, and then made the selection of features. And this method can be used to trend predicting of both micro stock data and macro stock market. v i e w O Most above methods lay emphasis on extracting time sequence features and transforming continuous time series data to a combination of extracted features. This often leads to data loss in calculation process, resulting in lower effectiveness in describing the nature as well as lower efficiency in forecasting, especially for large multivariate time series. In addition, although clustering is widely used, the pattern computing efficiency still faces challenges in dealing with large multivariate time series. The paper proposes a sliding window based multi-stage clustering approach towards multivariate time series by using dynamic sliding time windows (DSTW). The segmented multivariate time series are clustered separately in each time window to product first stage clustering centers, and which are used to generate second stage clustering results involving all time windows. The method can simplify large scale multivariate time series mining problems by multi-stage clustering on multiple sliding n l y URL: http:/mc.manuscriptcentral.com/gscs
Page 3 of 14 Journal of Statistical Computation and Simulation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 time windows, thus achieve improved efficiency. Based on the clustering outcomes, a correlation rules mining method is given to discover frequent patterns in the time series and generate association rules. Then, the paper presents a probabilistic forecasting model that leverages the extracted rules to make short term predictions. Finally, the experiments are presented to show the efficiency and effectiveness of the proposed approach. The rest of the paper is organized as follows. Section 2 presents the multi-stage clustering method using DSTW. Section 3 discusses the forecasting model based on rules by clustering. Section 4 gives experiments that verify the proposed approach. Section 5 concludes the work. 2. Multi-Stage Clustering Method Using DSTW For Multivariate Time Series F In this section, we present the multi-stage clustering method based on DSTW for multivariate time series data. o r 2.1. Fuzzy Centers Selection Method For Multivariate Time Series P e We improve the clustering algorithm based on K-means [15] to adapt it to multivariate time series. e The steps of K-means algorithm are described as follows. (1) Select k elements from the data source as initial centers. r R { } e , 1 2 µ µ µ µ µ (1) ,..., 3 = , k (2) Calculate the distance between each element and centers, then divide the elements to v i corresponding clusters according to the minimum distance. − e [16] (2) = : arg min x µ j c ( ) i ( ) i 2 j (3) Recalculate the mean value of each cluster. w µ j = : i m ( ) i ∑ ∑ { c 1 = 1 { m c 1 = 1 i = ( ) i ( ) i } j x } = j [17] (3) O n l y (4) Calculate the standard measuring function, and the algorithm would be terminated once certain conditions were met such as converging to one point, otherwise the process should return to step (2). As the clustering objects focus on multivariate time series typically represented by a set of curves rather than points, we improve the algorithm by incorporating features of the multiple temporal curves to adapt it to multivariate time series mining. The improvements mainly include two aspects as follows. (1) Initial centers selection Initial centers selection has a crucial impact the results of algorithm. Random selection of initial centers is a common way, but which usually leads to unsatisfied results of clustering. In addition, improper selection of initial centers typically results in superfluous iterations costing too much computing time. An optimization strategy is running multiple times and choosing a group of different random centers at each run, then selecting the clusters with minimum URL: http:/mc.manuscriptcentral.com/gscs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Journal of Statistical Computation and Simulation Page 4 of 14 quadratic sum of variance. This strategy is simple and the results depend on the number of clusters as well as data set. Another strategy is to select the successive initial center that farthest from the previous initial centers already selected. This method can guarantee that the selected initial centers are randomly scattered, but outliers may be chosen. We integrate two optimization strategies above and make further improvement. Figure 1shows the improved process of initial centers selection. The first center is selected randomly from source data set and added to a special collection used for storing centers. Then in the process of successive centers selection, once an element is selected randomly, the distance between the element and each existing centers will be calculated. The element could be confirmed as an initial center and added to the collection if the distance met predefined threshold. Otherwise, the successive center selection should be continued until the collection was filled up. The above process would be running multiple times, and the cluster with minimum quadratic sum of variance should be selected. F o r P e e r R e v i e w O n l y Figure 1. Initial centers selection method (2) Reselection of centers Since common mathematical methods such as average calculation are not suitable for time URL: http:/mc.manuscriptcentral.com/gscs
Page 5 of 14 Journal of Statistical Computation and Simulation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 series data elements clustering, we present centers reselection method for time series. As shown in Figure 2 is the method which assumes that the initial centers should be selected appropriately with small computing error in clustering process, and the elements in the cluster could be relatively close to current center. The method traverses twice the clusters that need reselection of centers, computing the distance between each element and other ones. Once over 50% of the distances are within the scope of preset threshold, the element is chosen as new center. If no elements could meet the threshold conditions, the initial center of this cluster should be regarded as improper choice, and a new element would be selected randomly as a new center. F o r P e e r R e Figure 2. Selection method of new center v i e 2.2. First-stage Clustering Within Time Windows w Multivariate time series should be divided into segments before perform clustering in time windows. Based on sliding time windows [18,19], dynamic sliding time windows enable dynamic variety of time windows by configuring the parameters during the segmentation process. Figure 3 shows the principle of DSTW. The key variables involve the size as well as sliding step-length of time windows for data partitioning. O n l y The size configuration of time windows has an essential impact on computational efficiency. Because we use the dynamic time warping (DTW) distance [20, 21] to compute the distances between multivariate time series, it is not necessary to make segments with equal length. Actually the overall trend of a segment is more important. Instead, the size of time windows can be set to a range [22]. One important thing to note is that too big distance between segments should be avoided because it could not make sense. In our method, the fluctuation range is set to 1~1.3 times of the initial window size. In addition, the sliding step also has critical influence on clustering results. Too small steps could lead to redundant repetitive computation because of overlapping of cross-window time series data sets. URL: http:/mc.manuscriptcentral.com/gscs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Journal of Statistical Computation and Simulation Page 6 of 14 F o r Figure 3. Dynamic sliding time window As the parameters are configured, the first-stage inner window clustering will be running. P As shown in Figure 4, firstly the inner window data of segmented multivariate time series will be extracted and stored in memory. Then for each collection of multivariate time series data, all segmented time series within the same time window will perform clustering. As a result, the method will generate a number of independent inner window clusters that cover all time period. e e r R e v i e w O Figure 4. Subsequence clustering in the same period 2.3. Second-stage Clustering Based on Centers n l y In the second stage, the results of small segmented clusters from the first stage will be aggregated to larger clusters. Firstly, the method gathers the clustering centers of the results in the first stage as the original data set of the second stage. Then second clustering will be running on the data set to generate the clusters of the centers. Further, the corresponding elements of each new cluster centers that appear in the first stage results will be copied to the corresponding clusters in the second stage to obtain the final clustering results. URL: http:/mc.manuscriptcentral.com/gscs
Page 7 of 14 Journal of Statistical Computation and Simulation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 F o r P e Figure 5. Multi-stage clustering method based on DSTW e r R Figure 5 shows the overall process of the multi-stage clustering method. The method takes data from Data Source Pool which contains sub sequences of all time periods. Then these data are organized in same period separately, and then the first stage cluster is executed with these data sets by invoking the improved K-means algorithm from Algorithm Pool. The results are temporarily saved in the Intermediate Result Storage Area which provides service to the second stage cluster. The center clustering is carried out with centers extracted from the clusters of first stage to generate the center clusters. Finally, the results of first stage clustering in the Intermediate Result Storage Area are integrated to the corresponding clusters according to the centers. And the final clusters are stored in Cluster Result Storage Area. v i e e w 3. Forecasting Model Based on Rules from Multi-stage Clustering O 3.1.Rule Base Establishment Based on Clustering Results n l y The self-correlation association rule mining for multivariate time series data is a non-standard process based on the rule extraction derived from the clustering results. Firstly, the method gets the first result set (the clustering results after pruning). Then draw the “Antecedents and consequences” from the source data from the first results set to constitute a new data set and cluster the data set to get the second result set. Finally, find rules of the derivation from the former to the latter, after computing confidence of the rule. The rules obtained in this part are as follow: A B→ (4) confidence = ( A B , ) count / A count (5) Where A is the former part, B is the latter part, (A, B)count is the count of the simultaneous URL: http:/mc.manuscriptcentral.com/gscs
分享到:
收藏