外文文献网络流数据挖掘.pdf

发布时间：2022-05-30 发布人：admin 分类：说明书资料大小：0.58M 资料格式：pdf 举报版权申诉

NashioY-2924953-4744302543332499687.pdf-第1页.png

第1页 / 共25页

NashioY-2924953-4744302543332499687.pdf-第2页.png

第2页 / 共25页

NashioY-2924953-4744302543332499687.pdf-第3页.png

第3页 / 共25页

NashioY-2924953-4744302543332499687.pdf-第4页.png

第4页 / 共25页

NashioY-2924953-4744302543332499687.pdf-第5页.png

第5页 / 共25页

NashioY-2924953-4744302543332499687.pdf-第6页.png

第6页 / 共25页

NashioY-2924953-4744302543332499687.pdf-第7页.png

第7页 / 共25页

NashioY-2924953-4744302543332499687.pdf-第8页.png

第8页 / 共25页

A framework for mining evolving trends in Web data streams using dynamic learning and retrospective validation

Introduction

Contributions of this paper

Organization of this paper

Tecno-streams (tracking evolving clusters in noisy streams)

Cloning in the dynamic immune system

Learning new data and relation to emerging trend detection

Tecno-streams: tracking evolving clusters in noisy data streams with a scalable immune system learning model

Example of learning an evolving synopsis from a noisy data stream

Mining evolving user profiles from noisy Web clickstream data

Similarity measures used in the learning phase of single-pass mining of clusters in Web data

Validation metrics for single-pass mining of evolving Web data streams

Validation methodology for single-pass miningof evolving Web data streams

Single-pass mining of evolving topics in text data

Simulation results on the 20 newsgroups data

Simulation results with single-pass mining ofuser profiles from real Web clickstream data

Conclusions

Acknowledgement

References

Computer Networks 50 (2006) 1488–1512 www.elsevier.com/locate/comnet A framework for mining evolving trends in Web data streams using dynamic learning and retrospective validation Olfa Nasraoui a,*, Carlos Rojas a, Cesar Cardona b,1 a Department of Computer Engineering and Computer Science, University of Louisville, Louisville, KY 40292, United States b Magnify Inc., Chicago, United States Available online 27 December 2005 Abstract The expanding and dynamic nature of the Web poses enormous challenges to most data mining techniques that try to extract patterns from Web data, such as Web usage and Web content. While scalable data mining methods are expected to cope with the size challenge, coping with evolving trends in noisy data in a continuous fashion, and without any unnec- essary stoppages and reconﬁgurations is still an open challenge. This dynamic and single pass setting can be cast within the framework of mining evolving data streams. The harsh restrictions imposed by the ‘‘you only get to see it once’’ constraint on stream data calls for diﬀerent computational models that may furthermore bring some interesting surprises when it comes to the behavior of some well known similarity measures during clustering, and even validation. In this paper, we study the eﬀect of similarity measures on the mining process and on the interpretation of the mined patterns in the harsh single pass requirement scenario. We propose a simple similarity measure that has the advantage of explicitly coupling the precision and coverage criteria to the early learning stages. Even though the cosine similarity, and its close relative such as the Jaccard measure, have been prevalent in the majority of Web data clustering approaches, they may fail to explicitly seek proﬁles that achieve high coverage and high precision simultaneously. We also formulate a validation strategy and adapt several metrics rooted in information retrieval to the challenging task of validating a learned stream synopsis in dynamic environments. Our experiments conﬁrm that the performance of the MinPC similarity is generally better than the cosine similarity, and that this outperformance can be expected to be more pronounced for data sets that are more challenging in terms of the amount of noise and/or overlap, and in terms of the level of change in the underlying pro- ﬁles/topics (known sub-categories of the input data) as the input stream unravels. In our simulations, we study the task of mining and tracking trends and proﬁles in evolving text and Web usage data streams in a single pass, and under diﬀerent trend sequencing scenarios. Ó 2005 Elsevier B.V. All rights reserved. Keywords: Mining evolving data streams; Web clickstreams; Web mining; Text mining; User proﬁles * Corresponding author. E-mail addresses: olfa.nasraoui@louisville.edu (O. Nasraoui), c.rojas@louisville.edu (C. Rojas), ccardona@magnify.com (C. Cardona). 1 This research was done while C. Cardona was at the University of Memphis. ‘‘You cannot step twice into the same stream. For as you are stepping in, other waters are ever ﬂowing on to you’’ HERACLITUS, c.535–475 BC, quoted by Plato. 1. Introduction 1389-1286/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.comnet.2005.10.021

O. Nasraoui et al. / Computer Networks 50 (2006) 1488–1512 1489 The Web has been a relentless generator of data that comes in a variety of forms, ranging from Web content data that forms the substance of most Web documents, to the daily trails left by visitors as they surf through a Website, also known as Web usage data. Hidden in this data, often lurk interesting knowledge or patterns such as Web user access trends or proﬁles that can be used to achieve various objectives, including supporting customer relation- ship management, and personalization of the user’s experience on a Website. Recently, data mining techniques have been applied to extract usage patterns from Web log data [3,6,18–21,24–26,29,30]. Most of these eﬀorts have proposed using various data mining or machine learning techniques to model and understand Web user activity. In [29], clustering was used to segment user sessions into clusters or proﬁles that can later form the basis for personalization. In [21], the notion of an adaptive Website was proposed, where the user’s access pattern can be used to automati- cally synthesize index pages. The work in [6] is based on using association rule discovery as the basis for modeling Web user activity, while the approach proposed in [3] used Markov Random Fields to model Web navigation patterns for the purpose of prediction. The work in [30] proposed building data cubes from Web log data, and later applying online analytical processing (OLAP) and data mining on the cube model. [25] presents a complete Web Usage Mining (WUM) system that extracts patterns from Web log data with a variety of data mining tech- niques. New relational clustering techniques with robustness to noise were used to discover user pro- ﬁles that can overlap in [20,19], while a density- based evolutionary clustering technique is proposed to discover multi-resolution and robust user proﬁles in [18]. The K Means algorithm was used in [24] to segment user sequences into diﬀerent clusters. An extensive survey of diﬀerent approaches to Web usage mining can be found in [26]. It is interesting to note that an incremental way to update a Web usage mining model was proposed in [3]. In this approach, the user navigation records are modeled by a hypertext probabilistic grammar (HPG) whose higher probability generated strings correspond to the user’s preferred trails. The model had the advan- tages of being self-contained (i.e., has all statistics needed to mine all the data accumulated), as well as compact (the model was in the form of a tree whose size depends on the number of items instead of the number of users, which enhances scalability). The HPG model was incremental, in the sense that when more log data became available, it could be incorporated in the model without the need of rebuilding the grammar from scratch. Unfortunately, with the exception of [3] (which provided a scalable way to model Web user naviga- tion, but did not explicitly address the change/ evolvability aspect of this data), all the aforemen- tioned methods assume that the entire pre-processed Web session data could reside in main memory. This can be a disadvantage for systems with limited main memory in case of huge Web session data, since the I/O operations would have to be extensive to shuﬄe chunks of data in and out, and thus com- promise scalability. Today’s Websites are a source of an exploding amount of clickstream data that can put the scalability of any data mining technique into question. Moreover, the Web access patterns on a Website are very dynamic in nature, due not only to the dynamics of Website content and structure, but also to changes in the users’ interests, and thus their nav- igation patterns. The access patterns can be observed to change depending on the time of day, day of week, and according to seasonal patterns or other external events. As an alternative to locking the state of the Web access patterns in a frozen state depending on when the Web log data was collected, an intelligent Web usage mining system should be able to continuously learn in the presence of such conditions without ungraceful stoppages, reconﬁgu- rations, or restarting from scratch. For all these rea- sons, Web usage data should be considered as a reﬂection of a dynamic environment which there- fore requires dynamic learning of the user access patterns. This dynamic setting can be cast within the framework of mining evolving data streams. Data streams are massive data sets that arrive with a throughput so high that the data can only be ana- lyzed sequentially and in a single pass. The discov- ery of useful patterns from data streams is referred to as stream data mining. In particular, a recent explosion of applications generating and analyzing data streams has added new unprecedented chal- lenges for clustering algorithms if they are to be able to track changing clusters in streams using only the new data points because storing past data is not even an option [1,2,5,10]. Because most data streams unleash data points or measurements in a non-arbitrary order, they are inherently attached to a temporal aspect, meaning that the patterns that could be discovered from them follow dynamic

1490 O. Nasraoui et al. / Computer Networks 50 (2006) 1488–1512 stream data, that may depart trends, and hence they are diﬀerent from traditional static data sets that are very large. Such data streams are referred to as evolving data streams. For these reasons, even techniques that are scalable for huge data sets may not be the answer for mining evolving data streams, because these techniques always strive to work on the entire data set without making any distinction between new data and old data, and hence cannot be expected to handle the notion of emerging and obsolete patterns. The harsh restric- tions imposed by the ‘‘you only get to see it once’’ constraint on stream data calls for diﬀerent compu- tational models and diﬀerent validation methodolo- gies from the ones used in traditional data mining. In [16], we proposed a new immune system inspired approach for clustering noisy multi-dimensional called TECNO-STREAMS (tracking evolving clusters in noisy streams), that has the advantages of scalability, robustness and automatic scale estimation. TECNO- STREAMS is a scalable clustering methodology that gleams inspiration from the natural immune system to be able to continuously learn and adapt to new incoming patterns by detecting an unknown number of clusters in evolving noisy data in a single pass. Data is presented in a stream, and is processed sequentially as it arrives, in a single pass over the data stream. A stream synopsis is learned in a con- tinuous fashion. The stream synopsis consists of a set of synopsis nodes or cluster representatives with additional properties, such as spatial scale and age, that oﬀer a summary of the data stream that is con- cise, and yet accurate. Because the data stream is a dynamic source of data, the stream synopsis itself is dynamic, and will change to reﬂect the status of the current data stream. The stream synopsis is con- strained so that its size does not exceed a maximal limit that is predeﬁned depending on the application, and preference will be given to newer parts of the data stream in occupying synopsis nodes that repre- sent them. Obsolete parts of the summary that corre- spond to older parts of the data stream are gradually purged from the synopsis, and delegated to second- ary storage memory. 1.1. Contributions of this paper Within the context of mining evolving Web click- streams, we apply the mechanics of TECNO- STREAMS to continuously discover an evolving proﬁle synopsis, consisting of synopsis nodes. Each synopsis node is an entity summarizing a basic Web usage trend, also referred to as proﬁle, that is char- acterized by the following descriptors: typical repre- sentative user session summary (a bag of URL indices), spatial scale or dispersion in the usage pat- tern around this representative (this is the amount of variance in the distance from compatible data stream input sessions and this node: it is also a mea- sure of error or variance that reﬂects the accuracy of the synopsis node as a representative of the input data stream), and age (time since the proﬁle’s birth). In this paper, we study the task of tracking emerging topics/clusters in noisy and evolving text data sets (text mining), and in mining evolving user proﬁles from Web clickstream data (Web usage mining) in a single pass, and under diﬀerent trend sequencing scenarios. A trend sequencing scenario corresponds to a speciﬁc way to order the Web ses- sions or the text documents as they are presented as input to the stream mining algorithm. If the user sessions or the text documents are known to belong to certain proﬁles or classes/document categories, also heretoforth called trends, then one particular sequencing scenario may be obtained by presenting the sessions or documents according to a particular sequence of the trends, such as ﬁrst the sessions/doc- uments from the ﬁrst trend, then the sessions/docu- ments from the second trend, and so on. Reversing the order of the trends will naturally result in a dif- ferent sequence. Similarly, the sessions/documents can be presented in the same order as they were received in the original data stream. In this case, we refer to the sequencing scenario as ‘‘regular order’’, ‘‘chronological order’’, or ‘‘natural chrono- logical order’’. We propose a validation methodology and several validation metrics that are rooted in information retrieval, and that are useful to assess the quality of the stream synopsis as a summary of an input data stream from the points of view of precision, coverage (or recall), and adaptability to the evolution of the stream. Coverage or recall of a synopsis node mea- sures the proportion of items matching the input data in its vicinity. High coverage means that the node covers most of the items in the input. On the other hand, high precision means that the synopsis node covers only the correct items in the input data, and not any additional superﬂuous items due to noise or other artifacts. Coverage and precision generally work in contradicting ways. or example, the best cov- erage is obtained when a single synopsis node con- tains all the items in the data, but then its precision will be very low. And vice versa, a node that consists

O. Nasraoui et al. / Computer Networks 50 (2006) 1488–1512 1491 of only one of the correct items will have 100% preci- sion, but its coverage will suﬀer. Our validation pro- cedure is useful within the framework of mining evolving Web data streams. Unlike existing frame- works which study mostly static data sets, our adopted validation strategy is based on taking into account both the content and the changing nature (hence the temporal aspect) of the stream data mining task, so that the synopsis discovered by data mining, is evaluated from the perspectives of precision and coverage throughout the entire temporal span of the experiment, and not just at one speciﬁc time, such as the end of the stream. This philosophy of validation gives rise to interesting retrospective validation mea- sures that are rooted in some classical information retrieval metrics, but that evaluate the learned synop- sis according to two diﬀerent dimensions: (i) the tem- poral dimension which corresponds to the order in which the input data stream arrives, and (ii) the ground-truth content categories from which the input data is known to originate. In the case of Web user sessions/clickstreams, the categories corre- spond to user trends, proﬁle, or classes; while in the case of text documents, the categories correspond to known document class labels. Since, the emphasis is on learning an accurate synopsis of an arbitrary and unlabeled dynamic data stream, these categories are only used during the ﬁnal validation phase, and not in the actual learning/data mining. Our previous discussion emphasized the impor- tance of both precision and coverage for assessing the quality of the learned synopsis. This is not sur- prising since these measures have always been the main validation metrics used in information retrie- val. However, this also led us to the ask the follow- ing question: If these metrics are so critical in assessing quality, then why not use ‘‘them’’ to guide the search for a better synopsis. Since each synopsis node is evaluated based on the density of similar data inputs in its vicinity, this led us to adapting the similarity measure to take into account the pre- cision and coverage metrics of a candidate synopsis node. For all these reasons, we investigate a simple similarity measure that has the advantage of explic- itly coupling the precision and coverage criteria to the early learning stages, hence requiring that the learned proﬁles are simultaneously precise and com- plete, with no compromises. A diagram showing the diﬀerent components of the stream mining frame- work is shown in Fig. 1. 1.2. Organization of this paper The rest of this paper is organized as follows. We start by reviewing the TECNO-STREAMS algo- rithm in Section 2. In Section 3, we describe how we can use TECNO-STREAMS to track evolving clusters in Web usage data. Then, we present our validation methodology and metrics for the evolv- ing stream mining framework. In Section 4 we apply our validation strategy to the task of mining real evolving Web clickstream data and for tracking evolving topic trends in textual stream data, while studying the eﬀect of the choice of diﬀerent similar- ity measures. Finally, in Section 5, we summarize our ﬁndings and present our conclusions. Fig. 1. Mining and validation of evolving data streams.

1492 O. Nasraoui et al. / Computer Networks 50 (2006) 1488–1512 2. Tecno-streams (tracking evolving clusters in noisy streams) In particular, In this section we present the main features of TECNO-STREAMS that are relevant to this paper, leaving most of the detail in [16]. The immune sys- tem (lymphocyte elements) can behave as an alter- native biological model of intelligent machines, in contrast to the conventional model of the neural system (neurons). the artiﬁcial immune network (AIN) model is based on Jerne’s Immune Network theory [11,27]. The system con- sists of a network of artiﬁcial B cell lymphocytes, XB, that summarize the learned model, hence play- ing the role of synopsis nodes. In addition to the B Cells, the immune network consists of stimulating and suppressing links between these cells. Learning takes as input a set of input data (external antigenic agents), Xa, and tries to learn an optimal immune network consisting of linked B Cells based on clon- ing operations. Each B Cell represents a learned pat- tern that could be matched to a data point or another B Cell in the network. A link between two B Cells gets stronger if they are more similar. This results in co-stimulation between similar cells. Because an excess of co-stimulation between similar B Cells can cause an explosive growth of the B Cell population in a small local neighborhood (by clon- ing), there is another phenomenon, known as co- suppression, which acts to balance the inﬂuence of close B Cells. In addition to controlling population growth and enabling memory in the learned immune network, co-stimulation and co-suppres- sion deﬁne implicit pathways of communication between the diﬀerent elements (synopsis nodes) of the immune network which act like an adaptive and distributed set of agents that track the distribu- tion of the input data stream. Without this collabo- rative or cooperative component in the learning, every synopsis node will act as an isolated element with no visibility of the neighboring cells. It has lately been recognized that a richer and higher level of intelligence can emerge from collaborative behav- ior between even the simplest agents. This phenom- enon is frequently referred to as emergence, where complex and organized global behavior can arise from the interaction of simple local rules. Examples can be found in ant colonies, bee swarms and bird ﬂocks [8,9,22]. In this speciﬁc context, this kind of collaborative behavior is expected to enhance mem- ory in a distributed manner, while aﬀecting the dynamics of learning. These crucial characteristics may well be essential to learning and adaptation in a single-pass setting, just as they are crucial to the survival of natural organisms in dynamic envi- ronments. Data from the input stream is matched against a B Cell or synopsis node based on a prop- erly chosen similarity measure. This aﬀects the syn- opsis node’s stimulation level, which in turn aﬀects both its outlook for survival, as well as the number of clones that it produces. Because clones are similar to their spawning parent, they together form a net- work of co-stimulated cells that can sustain them- selves even long after the disappearance of data that has initiated the cloning. However, this net- work of synopsis nodes will slowly wither and die if it is no longer stimulated by the data for which it has specialized, hence gradually forgetting old encounters. This forgetting is the reason why the immune system needs periodical reminders in the form of re-vaccination. The combined recall and forgetting behavior in the face of external antigenic agents forms the fundamental principle behind the concept of emerging or dynamic memory in the immune system. This is speciﬁcally the reason why the immune system metaphor oﬀers a very compet- itive model within the evolving data stream frame- work. In the following description, we present a more formal treatment of the intuitive concepts explained above. We will use TECNO-STREAMS to continuously and dynamically learn evolving patterns from dynamic Web data. To summarize our approach: (1) The input data can be extracted from Web log data (a Web log is a record of all ﬁles/URLs accessed by users on a Web site), or from a collec- tion of text documents, (2) the data is pre-processed (e.g., via cookies or IP address and time out window for Web logs) to produce session lists: A session list st for user number t is a list of indices of the URLs that were visited by the same user, represented as a binary vector (1 if the URL was visited during this session, and 0 otherwise). Each session will repre- sent a data record from the input stream. In the case of representation is obtained by replacing the notion of a URL by a keyword or term and a session by a document or Web page content. Hence a data object would cor- respond to one document represented as a list of terms; (3) the ith B Cell plays the role of a synopsis node that represents the ith candidate proﬁle psi and encodes relevant URLs (or keywords for text data), which are the attributes in this case, as well as the additional measures of scale (r2 i;J ) and age (tsi) at text documents, a similar

O. Nasraoui et al. / Computer Networks 50 (2006) 1488–1512 1493 any point J (after J inputs have been processed) in the stream sequence. In a dynamic environment, the objects xt from a data stream X are presented to the immune network one at a time, with the stimulation and scale mea- sures updated incrementally with each new data object. It is more convenient to think of the data index, t, as monotonically increasing with time. That is, the NX data points are presented in the fol- lowing chronological order: x1; x2; . . . ; xN X . Each synopsis node represents an inﬂuence zone over the input data space. However, since data is dynamic in nature, and has a temporal aspect, data that is more current will have higher inﬂuence compared to data that is less current. Quantitatively, the inﬂu- ence zone is deﬁned in terms of a weight function that decreases not only with distance from the data location to the synopsis node prototype, but also with the time since the data has been presented to the immune network. It is convenient to think of time as an additional dimension that is added to the synopsis node compared to the classical static B Cell, traditionally deﬁned in the data space only [17]. Robust weight/activation function: For the ith synopsis node, psi, i ¼ 1; . . . ; N P s , we deﬁne the acti- vation caused by the tth data point (received at sequential order or time t), after a total of J inputs have been received from the data stream as ð1Þ wit ¼ wi d 2 where s is a forgetting time constant that controls the time decay rate of the contribution from old data points, and hence how much emphasis is placed on the currency of the stream synopsis compared to the sequence of data points encountered so far. d 2 it is the distance from data xt (which is the tth data encountered by the stream synopsis) to synopsis node, psi . r2 i;J is a scale parameter that controls the decay rate of the weights along the spatial dimen- sions, and hence deﬁnes the size of an inﬂuence zone around a cluster prototype. Data samples falling far from this zone are considered outliers. The weight functions decrease exponentially with the order of presentation of a data point, t, and therefore, will favor more current data in the learning process. d2 it 2r2 i;J ¼ e þðJtÞ s it ; Inﬂuence zone: The ith synopsis node represents a soft inﬂuence zone, IZi, that can be interpreted as a robust zone of inﬂuence. IZi ¼ fxt 2 Xjwit P wming. ð2Þ Each synopsis node is allowed to have is own zone of inﬂuence with radial size proportional to r2 i;J , that is dynamically estimated. Hence, outliers are easily detected as data points falling outside the inﬂuence zone of all synopsis nodes or through their weak activations (wit < wmin, "i). Stimulation/optimization criterion: The stimula- tion level, after J data points have been presented to synopsis node psi, is deﬁned as the density of the data population around psi (i.e., an estimate of the spatial density at the synopsis node as measured by the number of points in the inﬂuence zone, divided by the radius of this inﬂuence zone): P di;J ¼ J t¼1wit r2 i;J . ð3Þ 2.1. Cloning in the dynamic immune system P The synopsis nodes are cloned in proportion to their stimulation levels relative to the average net- work stimulation by creating N clonesi clones or dupli- cates of the ith node, where N clonesi ¼ K clone . When the synopsis node population size (N P sðtÞ) at any time t exceeds a pre-speciﬁed maximum (N P max ), the synopsis nodes are sorted in ascending order of their stimulation levels, and the top (N P sðtÞ N P max ) synopsis nodes (with lowest stimulation) are killed or archived in long term secondary storage in case of low stimulation cells that are mature or old. di;J NPs k¼1 dk;J 2.2. Learning new data and relation to emerging trend detection Somatic hyper-mutation is a powerful natural exploration mechanism in the immune system, that allows it to learn how to respond to new data that has never been seen before. However, from a compu- tational point of view, this is a very costly and ineﬃ- cient operation since its complexity is exponential in the number of features. Therefore, we model this operation in the artiﬁcial immune system model by an instant data duplication whenever a data point is encountered that fails to activate the entire stream synopsis. A new data, xt is said to activate the ith syn- opsis node, if it falls within its inﬂuence zone, IZi, essentially meaning that its activation of this synopsis node, wit exceeds a minimum threshold wmin. Potential outlier: A Potential outlier is a data point that fails to activate the entire synopsis, i.e., wit < wmin, 8i ¼ 1; . . . ; N P s . The outlier is termed

1494 O. Nasraoui et al. / Computer Networks 50 (2006) 1488–1512 potential because, initially, it may either be an out- lier or a new emerging pattern. It is only through the continuous learning process that lies ahead, that the fate of this outlier will be decided. If it is indeed a true outlier, then it will form no mature nodes in the stream synopsis. 2.3. Tecno-streams: tracking evolving clusters in noisy data streams with a scalable immune system learning model The following algorithm is only an intuitive list of steps for learning a dynamic synopsis from an evolv- ing input data stream. More details can be found in [16]. TECNO-STREAMS algorithm: (optional steps are enclosed in []) INPUT: data stream xt OUTPUT: up to a maximum of N P max syn- opsis nodes psi in the stream synopsis Fix the maximal population size or number of synopsis nodes, N P max ; Initialize population i ¼ rinit using the ﬁrst N P max input data; r2 Compress stream synopsis into K subnet- works, with centroid, Ck, k = 1,. . .,K, using 2 iterations of K Means; Repeat for each incoming data xt { synopsis node and k incrementally; Present data to each subnetwork centroid, Ck, k = 1,. . .,K in network : Compute dis- tance, activation weight, wkt and update subnetwork’s scale r2 Determine the most activated subnetwork (the one with maximum wkt); IF All synopsis nodes in most activated subnetwork have wit < wmin (data does subnetwork) not THEN{ suﬃciently activate Create by duplication a new synopsis node = psnew ¼ xt and r2 new ¼ rinit; } ELSE { Repeat for each synopsis node psi most activated subnetwork { in IF wit > wmin (i.e., data activates synopsis node psi) THEN Refresh age (tsi ¼ 0) for synopsis node psi ; ELSE Increment age (tsi) for synopsis node psi ; Compute distance from data xj to synopsis node psi; Compute synopsis node psi ’s stimula- tion level; Update synopsis node psi ’s scale r2 i ; } } Clone and mutate synopsis nodes; IF synopsis size N P sðtÞ > N P max Then { IF (Age of ithsynopsis node tsi < tminÞ THEN Temporarily scale synopsis node’s stimulation level to the network aver- age stimulation; Sort synopsis nodes in ascending order of their stimulation level; Kill worst excess (top (N P sðtÞ N P maxÞ according to previous sorting) synopsis nodes; [or move oldest/mature synopsis nodes to secondary (long term) storage]; } Compress stream synopsis periodically (after every T data points), into K subnet- works using 2 iterations of K Means with the previous centroids as initial centroids; } 2.4. Example of learning an evolving synopsis from a noisy data stream parameters were To illustrate how the synopsis if formed dynam- ically as the input data stream arrives, we show the results on a noisy 2-D data set with three clusters (or trends) and 1400 data points because the results can be inspected visually and easily. The implementa- s = 100, tion wmin = 0.2, and compression with rate K = 10 after every T = 40 inputs have been processed. The evolu- tion of the synopsis, limited to a maximum size of 30 nodes, for three noisy clusters, when the input data is presented in the order of the clusters or trends is shown in Fig. 2, where the synopsis nodes are super- imposed on the original data set seen so far, at N P max ¼ 30,

O. Nasraoui et al. / Computer Networks 50 (2006) 1488–1512 1495 1 0.8 0.6 0.4 0.2 1 0.8 0.6 0.4 0.2 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (a) 0.6 0.8 1 0 0 0.2 0.4 (b) 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 (c) Fig. 2. Single pass results on a noisy data stream presented one input at a time in the same order as the trends/clusters: location of synopsis nodes and estimated scales for a noisy stream with three clusters after processing (a) 250 samples, (b) 500 samples, and (c) all 1134 samples. zone, inﬂuence this node’s diﬀerent snapshots corresponding to distinct mile- stones within the data stream. The three milestones approximate the instant following the presentation of the inputs from the ﬁrst cluster, then the second cluster, and ﬁnally the third cluster. Each pair of crossing vertical and horizontal lines is centered on a node location, and their length represents the diameter of IZ, ( 3rsi). This trend sequencing scenario is the most diﬃcult (worst) case for single-pass learning, as it truly tests the ability of the system to memorize the old patterns, adapt to new patterns, and still work within the constraints of a small synopsis. We emphasize that the ﬁnal results after all inputs have been processed is equivalent to a single pass, resulting in a small synopsis size of only 30 nodes, that evolves together with the original input data stream to form a dynamic summary of its distribu- tion. Note how the noise is ignored after a suﬃcient number of inputs, except for the last snapshot that shows the synopsis capturing a few noise points that have just been received as input. Hence they are expected to be removed like all the previous noise points if the stream were continued. 3. Mining evolving user proﬁles from noisy Web clickstream data 3.1. Similarity measures used in the learning phase of single-pass mining of clusters in Web data For many data mining applications such as clus- tering text documents and other high dimensional data sets, the Euclidean distance measure is not appropriate. This is due mainly to the high dimen- sionality of the problem, and the fact that two docu- ments may not be considered similar if keywords are missing in both documents. More appropriate for this application, is a distance based on the cosine sim- ilarity measure between data item xt and a learned synopsis node proﬁle psi, which in the simplest case, can both be deﬁned as binary vectors of length NI, the total number of items (URLs or keywords), [12], Scosði; tÞ ¼ : ð4Þ q P ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ P k¼1xt;k psi;k N I N I k¼1psi;k k¼1xt;k P N I q ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ i;tCovgL PrecL i;t It is easy to show that the cosine similarity is related to the well known information retrieval measures of precision and coverage as follows: Scosði; tÞ ¼ where the precision in the learning phase, PrecL i;t de- scribes the accuracy of the learned synopsis node pro- ﬁles psi in representing the data xt, or the ratio of the number of matching items (URLs or terms) between the learned proﬁle and the data (session or document) to the number of items in the learned proﬁle: ð5Þ ; P PrecL i;t ¼ N I P k¼1xt;k psi;k N I k¼1psi;k ; ð6Þ while the coverage (also known as recall) in the learning phase, CovgL ij describes the completeness of the learned synopsis node proﬁles psi in represent- ing the stream data point xt, or the ratio of the num- ber of matching items (URLs or terms) between the learned proﬁle and the data (session or document) to the number of items in the data: P CovgL i;t ¼ . ð7Þ P k¼1xt;k psi;k N I N I k¼1xt;k In light of (5), we can see that using the cosine similarity as a basis for the distance used to compute the density around each synopsis node, given by the

分享到：

赞收藏

资料库

外文文献网络流数据挖掘.pdf

相关推荐

开发技术

热门标签

最新资料