HSMM人工智能文档.pdf

发布时间：2022-06-13 发布人：admin 分类：说明书资料大小：0.54M 资料格式：pdf 举报版权申诉

1258761f-fa9d-471c-afbf-e315278cb30c.pdf-第1页.png

第1页 / 共29页

1258761f-fa9d-471c-afbf-e315278cb30c.pdf-第2页.png

第2页 / 共29页

1258761f-fa9d-471c-afbf-e315278cb30c.pdf-第3页.png

第3页 / 共29页

1258761f-fa9d-471c-afbf-e315278cb30c.pdf-第4页.png

第4页 / 共29页

1258761f-fa9d-471c-afbf-e315278cb30c.pdf-第5页.png

第5页 / 共29页

1258761f-fa9d-471c-afbf-e315278cb30c.pdf-第6页.png

第6页 / 共29页

1258761f-fa9d-471c-afbf-e315278cb30c.pdf-第7页.png

第7页 / 共29页

1258761f-fa9d-471c-afbf-e315278cb30c.pdf-第8页.png

第8页 / 共29页

Hidden semi-Markov models

Introduction (History)

Hidden semi-Markov model

General model

Inference

The forward-backward algorithm

Probabilities and expectations

MAP and MLE estimate of states

Constrained estimate of states

Estimation

Parameter estimate of HSMM

Order estimate of HSMM

EM algorithm and online estimation

Re-estimation vs. EM algorithm

Forward algorithm for online estimation

Implementation

Conventional models

Explicit duration HMM

Computational complexity

Variable transition HMM

Residential time HMM

Computational complexity

Specific modelling issues

Different duration distributions

Exponential family distribution of duration

Convex monotonic distributions

Discrete Coxian distribution of duration

Different observation distributions

Parametric distribution of observations

Segmental model

Event sequence model

Variants of HSMM

Switching HSMM

Multi-channel HSMM

Adaptive factor HSMM

HSMM vs. HMM

HSMM using HMM algorithms

Duration estimation using HMM algorithms.

Bounded state durations.

Applications of HSMM

Human activity recognition

Handwriting recognition

Network traffic characterization and anomaly detection

Speech recognition and speech synthesis

Functional MRI brain mapping

Conclusions and remarks

Acknowledgements

References

Artiﬁcial Intelligence 174 (2010) 215–243 Contents lists available at ScienceDirect Artiﬁcial Intelligence www.elsevier.com/locate/artint Hidden semi-Markov models Shun-Zheng Yu Department of Electronics and Communication Engineering, Sun Yat-Sen University, Guangzhou 510275, PR China a r t i c l e i n f o a b s t r a c t Article history: Received 14 April 2009 Available online 17 November 2009 Keywords: Hidden Markov model (HMM) Hidden semi-Markov model (HSMM) Explicit duration HMM Variable duration HMM Forward–backward (FB) algorithm Viterbi algorithm As an extension to the popular hidden Markov model (HMM), a hidden semi-Markov model (HSMM) allows the underlying stochastic process to be a semi-Markov chain. Each state has variable duration and a number of observations being produced while in the state. This makes it suitable for use in a wider range of applications. Its forward– backward algorithms can be used to estimate/update the model parameters, determine the predicted, ﬁltered and smoothed probabilities, evaluate goodness of an observation sequence ﬁtting to the model, and ﬁnd the best state sequence of the underlying stochastic process. Since the HSMM was initially introduced in 1980 for machine recognition of speech, it has been applied in thirty scientiﬁc and engineering areas, such as speech recognition/synthesis, human activity recognition/prediction, handwriting recognition, functional MRI brain mapping, and network anomaly detection. There are about three hundred papers published in the literature. An overview of HSMMs is presented in this paper, including modelling, inference, estimation, implementation and applications. It ﬁrst provides a uniﬁed description of various HSMMs and discusses the general issues behind them. The boundary conditions of HSMM are extended. Then the conventional models, including the explicit duration, variable transition, and residential time of HSMM, are discussed. Various duration distributions and observation models are presented. Finally, the paper draws an outline of the applications. © 2009 Elsevier B.V. All rights reserved. 1. Introduction (History) A hidden Markov model (HMM) is deﬁned as a doubly stochastic process. The underlying stochastic process is a discrete- time ﬁnite-state homogeneous Markov chain. The state sequence is not observable and so is called hidden. It inﬂuences another stochastic process that produces a sequence of observations. An excellent tutorial of HMMs can be found in Rabiner [150], a theoretic overview of HMMs can be found in Ephraim and Merhav [57] and a discussion on learning and inference in HMMs in understanding of Bayesian networks is presented in Ghahramani [66]. The HMMs are an important class of models that are successful in many application areas. However, due to the non-zero probability of self-transition of a non- absorbing state, the state duration of an HMM is implicitly a geometric distribution. This makes the HMM has limitations in some applications. As an extension of the HMM, a hidden semi-Markov model (HSMM) is traditionally deﬁned by allowing the underlying process to be a semi-Markov chain. Each state has a variable duration, which is associated with the number of observations produced while in the state. The HSMM is also called “explicit duration HMM” [60,150], “variable-duration HMM” [107, 155,150], “HMM with explicit duration” [124], “hidden semi-Markov model” [126], generalized HMM [94], segmental HMM [157] and segment model [135,136] in the literature, depending on their assumptions and their application areas. E-mail address: syu@mail.sysu.edu.cn. 0004-3702/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.artint.2009.11.011

216 S.-Z. Yu / Artiﬁcial Intelligence 174 (2010) 215–243 The ﬁrst approach to hidden semi-Markov model was proposed by Ferguson [60], which is partially included in the survey paper by Rabiner [150]. This approach is called the explicit duration HMM in contrast to the implicit duration of the HMM. It assumes that the state duration is generally distributed depending on the current state of the underlying semi-Markov process. It also assumes the “conditional independence” of outputs. Levinson [107] replaced the probability mass functions of duration with continuous probability density functions to form a continuously variable duration HMM. As Ferguson [60] pointed out, an HSMM can be realized in the HMM framework in which both the state and its sojourn time since entering the state are taken as a complex HMM state. This idea was exploited in 1991 by a 2-vector HMM [93] and a duration-dependent state transition model [179]. Since then, similar approaches were proposed in many applications. They are called in different names such as inhomogeneous HMM [151], non-stationary HMM [164], and recently triplet Markov chains [144]. These approaches, however, have the common problem of computational complexity in some applications. A more eﬃcient algorithm was proposed in 2003 by Yu and Kobayashi [199], in which the forward–backward variables are deﬁned using the notion of a state together with its remaining sojourn (or residual life) time. This makes the algorithm practical in many applications. The HSMM has been successfully applied in many areas. The most successful application is in speech recognition. The ﬁrst application of HSMM in this area was made by Ferguson [60]. Since then, there have been more than one hundred such papers published in the literature. It is the application of HSMM in speech recognition that enriches the theory of HSMM and develops many algorithms for HSMM. Since the beginning of 1990’s, the HSMM started being applied in many other areas such as electrocardiograph (ECG) [174], printed text recognition [4] or handwritten word recognition [95], recognition of human genes in DNA [94], language identiﬁcation [118], ground target tracking [88], document image comparison and classiﬁcation at the spatial layout level [81], etc. In recent years from 2000 to present, the HSMM has been obtained more and more attentions from vast application areas such as change-point/end-point detection for semi-conductor manufacturing [64], protein structure prediction [162], mobility tracking in cellular networks [197], analysis of branching and ﬂowering patterns in plants [69], rain events time se- ries model [159], brain functional MRI sequence analysis [58], satellite propagation channel modelling [112], Internet traﬃc modelling [198], event recognition in videos [79], speech synthesis [204,125], image segmentation [98], semantic learning for a mobile robot [167], anomaly detection for network security [201], symbolic plan recognition [54], terrain modelling [185], adaptive cumulative sum test for change detection in non-invasive mean blood pressure trend [193], equipment prognosis [14], ﬁnancial time series modelling [22], remote sensing [147], classiﬁcation of music [113], and prediction of particulate matter in the air [52], etc. The rest of the paper is organized as follows: Section 2 is the major part of this paper that deﬁnes a uniﬁed HSMM and addresses important issues related to inference, estimation and implementation. Section 3 then presents three conventional HSMMs that are applied vastly in practice. Section 4 discusses the speciﬁc modelling issues, regarding duration distributions, observation distributions, variants of HSMMs, and the relationship to the conventional HMM. Finally, Section 5 highlights major applications of HSMMs and concludes the paper in Section 6. 2. Hidden semi-Markov model This section provides a uniﬁed description of HSMMs. A general HSMM is deﬁned without speciﬁc assumptions on the state transitions, duration distributions and observation distributions. Then the important issues related to inference, esti- mation and implementation of the HSMM are discussed. A general expression of the explicit-duration HMMs and segment HMMs can be found in Murphy [126], and a uniﬁed view of the segment HMMs can be found in Ostendorf et al. [136]. Detailed review for the conventional HMM can be found in the tutorial by Rabiner [150], the overview by Ephraim and Merhav [57], the Bayesian networks-based discussion by Ghahramani [66], and the book by Cappe et al. [29]. 2.1. General model A hidden semi-Markov model (HSMM) is an extension of HMM by allowing the underlying process to be a semi-Markov chain with a variable duration or sojourn time for each state. Therefore, in addition to the notation deﬁned for the HMM, the duration d of a given state is explicitly deﬁned for the HSMM. State duration is a random variable and assumes an integer value in the set D = {1, 2, . . . , D}. The important difference between HMM and HSMM is that one observation per state is assumed in HMM while in HSMM each state can emit a sequence of observations. The number of observations produced while in state i is determined by the length of time spent in state i, i.e., the duration d. Now we provide a uniﬁed description of HSMMs. Assume a discrete-time Markov chain with the set of (hidden) states S = {1, . . . , M}. The state sequence is denoted by S1:T S1, . . . , S T , where St ∈ S is the state at time t. A realization of S1:T is denoted as s1:T . For simplicity of notation in the following sections, we denote: • St1:t2 and St2 = i – state i that the system stays in during the period from t1 to t2. In other words, it means St1 = i. Note that the previous state St1−1 and the next state St2+1 may or may not be i. = i, St1+1 = i, . . . ,

S.-Z. Yu / Artiﬁcial Intelligence 174 (2010) 215–243 217 Fig. 1. General HSMM. = i, where S[t1 = i, St1+1 = i, . . . , St2 = i – state i that starts at time t1 and lasts till t2, with S[t1 • S[t1:t2] = i – state i which starts at time t1 and ends at t2 with duration d = t2 − t1 + 1. This implies that the previous state St1−1 and the next state St2+1 must not be i. = i means that • S[t1:t2 at t1 the system switched from some other state to i, i.e., the previous state st1−1 must not be i. The next state St2+1 may or may not be i. = i, St1+1 = i, . . . , St2] = i, where St2] = i means that • St1:t2] = i – state i that lasts from t1 to t2 and ends at t2 with St1 at time t2 the state will end and transit to some other state, i.e., the next state St2+1 must not be i. The previous state St1−1 may or may not be i. Based on these deﬁnitions, S[t] = i means state i starting and ending at t with duration 1, S[t = i means state i starting at t, St] = i means state i ending at t, and St = i means the state at t being state i. Denote the observation sequence by O 1:T O 1, . . . , O T , where O t ∈ V is the observable at time t and V = {v1, v2, . . . , v K} is the set of observable values. For observation sequence o1:T , the underlying state sequence is S1:d1] = i1, = in, and the state transitions are (im, dm) → (im+1, dm+1), for m = 1, . . . , n− 1, S[d1+1:d1+d2] = i2, . . . , S[d1+···+dn−1+1:d1+···+dn m=1 dm = T , i1, . . . , in ∈ S, and d1, . . . , dn ∈ D. Note that the ﬁrst state i1 is not necessary starting at time 1 where associated with the ﬁrst observation o1 and the last state in is not necessary ending at time T associated with the last observation oT . Detailed discussion about the censoring issues can be found in Section 2.2.1. Deﬁne the state transition probability from (i, d n j∈S\{i} )( j,d) P[S[t+1:t+d] = j|S[t−d+1:t] = i], a(i,d ∈ D. subject to From the deﬁnition we can see that the previous state i started at t − d . Then it )( j,d). State j will start at t + 1 and transits to state j having duration d, according to the state transition probability a(i,d end at t + d. This means both the state and the duration are dependent on both the previous state and its duration. While in state j, there will be d observations ot+1:t+d being emitted. Denote this emission probability by )( j,d) = 1 with zero self-transition probabilities a(i,d + 1 and ended at t, with duration d )(i,d) = 0, where i, j ∈ S and d, d ) → ( j, d) for i = j by d∈D a(i,d b j,d(ot+1:t+d) P[ot+1:t+d|S[t+1:t+d] = j] which is assumed to be independent to time t. Let the initial distribution of the state be π j,d P[S[t−d+1:t] = j], t 0, d ∈ D. It represents the probability of the initial state and its duration before time t = 1 or before the ﬁrst observation o1 obtained. Then the set of the model parameters for the HSMM is deﬁned by λ a(i,d where i, j ∈ S, d, d )( j,d), b j,d(vk1:kd ), πi,d ∈ D, and vk1:kd represents vk1 . . . vkd , ∈ V × ··· × V. This general HSMM is shown in Fig. 1. The general HSMM is reduced to speciﬁc models of HSMM depending on the assumptions they made. For instance: • If the state duration is assumed to be independent to the previous state, then the state transition probability can be further speciﬁed as a(i,d )( j,d) = a(i,d ) j p j(d), where ) j P[S[t+1 = j|S[t−d+1:t] = i] a(i,d (1)

218 S.-Z. Yu / Artiﬁcial Intelligence 174 (2010) 215–243 is the transition probability from state i that has stayed for duration d to state j that will start at t + 1, and p j(d) P[St+1:t+d] = j|S[t+1 = j] (2) is the probability of duration d that state j takes. This is the model proposed by Marhasev et al. [119]. • If a state transition is assumed to be independent to the duration of the previous state, then the state transition probability becomes a(i,d )( j,d) = ai( j,d), where ai( j,d) P[S[t+1:t+d] = j|St] = i] (3) is the transition probability that state i ended at t and transits to state j having duration d. This is the residential time HMM (see Section 3.3 for details). In this model, a state transition for i = j is (i, 1) → ( j, τ ) and a self-transition is assumed to be (i, τ ) → (i, τ − 1) for τ > 1, where τ represents the residential time of the state. • If a self-transition is allowed and is assumed to be independent to the previous state, then the state transition proba- bility becomes )( j,d) = a(i,d ) j a(i,d a jj(τ ) d−1 τ=1 1 − a jj(d) , where a jj(d) P[St+d+1 = j|S[t−d+1:t] = i, S[t+1:t+d = j] = P[St+d+1 = j|S[t+1:t+d = j] is the self-transition probability when state j has stayed for d time units, and 1− a jj(d) = P[St+d] = j|S[t+1:t+d = j] is the probability state j ends with duration d. This is the variable transition HMM (see Section 3.2 for details). In this model, a state transition is either (i, d) → ( j, 1) for i = j or (i, d) → (i, d + 1) for a self-transition. • If a transition to the current state is independent to the duration of the previous state and the duration is only condi- )( j,d) = aij p j(d), where aij P[S[t+1 = j|St] = i] is the transition probability from tioned on the current state, then a(i,d state i to j, with the self-transition probability aii = 0. This is the explicit duration HMM (see Section 3.1 for details). Besides, the state duration distributions, p j(d), can be parametric or non-parametric. The detailed discussion on various duration distributions can be found in Section 4.1. Similarly, the observation distributions b j,d(vk1:kd ) can be parametric or non-parametric, discrete or continuous, and dependent or independent on the state durations. It can also be a mixture of distributions. The detailed discussion on various observation distributions can be found in Section 4.2. 2.2. Inference In this subsection we discuss the issues related to inference, including the forward–backward algorithm, calculation of probabilities and expectations, maximum a posteriori (MAP) estimate of states, maximum likelihood estimate (MLE) of state sequence, and constrained estimate of states. 2.2.1. The forward–backward algorithm We deﬁne the forward variables for HSMM by: αt ( j, d) P[S[t−d+1:t] = j, o1:t|λ] and the backward variables by βt ( j, d) P[ot+1:T|S[t−d+1:t] = j, λ]. · a(i,d αt ( j, d) = αt−d i, d d∈D for t > 0, d ∈ D, j ∈ S, and d∈D i∈S\{ j} i∈S\{ j} βt ( j, d) = Similar to deriving the formulas for the HMM (see e.g., Rabiner [150], Ephraim and Merhav [57]), it is easy to obtain the forward–backward algorithm for a general HSMM: )( j,d) · b j,d(ot−d+1:t ), i, d , (4) (5) a( j,d)(i,d ) · bi,d (ot+1:t+d ) · βt+d for t < T . The initial conditions generally can have two different assumptions: • The general assumption: assumes that the ﬁrst state begins at or before observation o1 and the last state ends at or after observation oT . In this case, we can assume that the process starts at −∞ and terminates at +∞. The observations out of the sampling period [1, T] can be any possible values, i.e., b j,d(·) = 1 for any j ∈ S, d ∈ D. Therefore, in the forward

S.-Z. Yu / Artiﬁcial Intelligence 174 (2010) 215–243 219 formula (4) b j,d(ot−d+1:t ) is replaced with the marginal distribution b j,d(o1:t ) if t − d + 1 1 and t 1, and in the backward formula (5) bi,d (ot+1:t+d ) is replaced with bi,d (ot+1:T ) if t + 1 T and t + d T . We then have the initial conditions for the forward recursion formula given by (4) as follows: ατ ( j, d) = P[S[τ−d+1:τ] = j|λ] = π j,d, τ 0, d ∈ D, where {π j,d} can be the equilibrium distribution of the underlying semi-Markov process. Because, for t + d T , P[S[t+1:t+d] = i, ot+1:T|S[t−d+1:t] = j, λ] = a( j,d)(i,d )bi,d t+1 oT then from the backward recursion formula (5) we can see that βt+d (i, d conditions for the backward recursion formula given by (5) are as follows: ) = 1, for t + d T . Therefore, the initial βτ (i, d) = 1, τ T , d ∈ D. If the model assumes that the ﬁrst state begins at t = 1 and the last state ends at or after observation oT , it is a right- censored HSMM introduced by Guedon [70]. Because this is desirable for many applications, it is taken as a basis for an R package for analyzing HSMMs [23]. • The simplifying assumption: assumes that the ﬁrst state begins at time 1 and the last state ends at time T . This is the most popular assumption one can ﬁnd in the literature. In this case, the initial conditions for the forward recursion formula given by (4) are: α0( j, d) = π j,d, ατ ( j, d) = 0, d ∈ D, τ < 0, d ∈ D, and the initial conditions for the backward recursion formula given by (5) are: βT (i, d) = 1, βτ (i, d) = 0, d ∈ D, τ > T , d ∈ D. Note that to j,db j,d(o1:d), for d ∈ D. π i,d πi,da(i,d the initial distribution of states can be assumed as π P[S[1:d] = j|λ], which obviously equals )( j,d). Therefore, the initial conditions for the forward recursion formula can also be αd( j, d) = j,d 2.2.2. Probabilities and expectations After the forward variables {αt ( j, d)} and the backward variables {βt ( j, d)} are determined, all other probabilities of interest can be computed. For instance, the ﬁltered probability that state j started at t−d+ 1 and ends at t, with duration d, given partial observed sequence o1:t can be determined by P[S[t−d+1:t] = j|o1:t , λ] = αt ( j, d) j,d αt ( j, d) and the predicted probability that state j will start at t+1 and end at t+d, with duration d, given partial observed sequence o1:t by i= j,d αt (i, d )a(i,d i,d αt (i, d ) )( j,d) P[S[t+1:t+d] = j|o1:t , λ] = These readily yield the ﬁltered probability of state j ending at t, P[St] = j|o1:t , λ] = predicted probability of state j starting at t + 1, P[S[t+1 = j|o1:t , λ] = d P[S[t−d+1:t] = j|o1:t , λ], and the The posterior probabilities P[St = j|o1:T , λ], P[St = i, St+1 = j|o1:T , λ] and P[S[t−d+1:t] = j|o1:T , λ] for given entire obser- d P[S[t+1:t+d] = j|o1:t , λ]. . vation sequence o1:T can be determined by the following equations ξt i, d ; j, d P[S[t−d+1:t] = i, S[t+1:t+d] = j, o1:T|λ] = αt ; j, d ηt ( j, d) P[S[t−d+1:t] = j, o1:T|λ] = αt ( j, d)βt ( j, d), ξt (i, j) P[St] = i, S[t+1 = j, o1:T|λ] = D d=τ−t+1 γt ( j) P[St = j, o1:T|λ] = d∈D d∈D ητ ( j, d) τt ξt i, d i, d , a(i,d )( j,d)b j,d(ot+1:t+d)βt+d( j, d), (6) (7) (8) and

220 S.-Z. Yu / Artiﬁcial Intelligence 174 (2010) 215–243 j∈S j∈S γt ( j), P[St = j, o1:T|λ] = P[o1:T|λ] = ∈ D, j ∈ S, i ∈ S\{ j} and t = 1, . . . , T , where ηt ( j, d)/P[o1:T|λ] represents the probability of being in state j having for d, d ; j, d)/P[o1:T|λ] the probability of transition at duration d by time t given the model and the observation sequence; ξt (i, d to state j having duration d given the model and the observation sequence; time t from state i occurred with duration d ξt (i, j)/P[o1:T|λ] the probability of transition at time t from state i to state j given the model and the observation se- quence; γt ( j)/P[o1:T|λ] the probability of state j at time t given the model and the observation sequence; and P[o1:T|λ] the probability that the observed sequence o1:T is generated by the model λ. Obviously, the conditional factor P[o1:T|λ] is common for all the posterior probabilities, which will be eliminated when the posterior probabilities are used in parameter estimation. Therefore, it is often omitted for simplicity in the literature. Similarly, in the rest of this paper, we sometimes ; j, d), ξt (i, j), will not explicitly mention this conditional factor in calculating the posterior probabilities by ηt ( j, d), ξt (i, d and γt ( j). In considering the following identity P[St:t+1 = j, o1:T|λ] = P[St = j, o1:T|λ] − P[St] = j, o1:T|λ], P[St:t+1 = j, o1:T|λ] = P[St+1 = j, o1:T|λ] − P[S[t+1 = j, o1:T|λ] we have a recursive formula for calculating γt ( j): γt ( j) = γt+1( j) + P[St] = j, o1:T|λ] − P[S[t+1 = j, o1:T|λ] = γt+1( j) + i∈S\{ j} ξt ( j, i) − ξt (i, j) . (9) Denote P[o1:T|λ] by L in the following expressions. Then using the forward and backward variables, one can compute various expectations [60]: tt−1 (b) Expected total duration spent in state i: 1 (c) Expected number of times that state i occurred with observation ot = vk: 1 L starts at t or before: 1 L (a) The expected number of times state i ends before t: 1 L j∈S\{i} ξt ( j, i). t γt (i). function I(x) = 1 if x is true and zero otherwise. (d) Estimated average observable values of state i: (e) Probability that state i was the ﬁrst state: 1 L γ1(i). (f) Expected total number of times state i commenced: 1 L t γt (i)ot t γt (i) . j∈S\{i} ξt ( j, i) or terminated: 1 L t t simplifying assumption for the boundary conditions described in the last subsection, we have tt j∈S\{i} ξt (i, j); The expected number of times state i t γt (i)I(ot = vk), where the indicator L j∈S\{i} ξt (i, j). For the T−1 j∈S\{i} ξt ( j, i) = t=0 T t=1 j∈S\{i} ξt (i, j). (g) Estimated average duration of state i: t t d ηt (i,d)d d ηt (i,d) . The maximum a posteriori (MAP) estimate of state St given a speciﬁc observation sequence o1:T can be obtained [60] by 2.2.3. MAP and MLE estimate of states maximizing γt ( j) given by (8), i.e., ˆst = arg max i∈S γt (i) . If we choose ηt (i, d) of (6), instead of γt (i), as the MAP criterion, we obtain the joint MAP estimate of the state that ends at time t and the duration of this state, when a speciﬁc sequence o1:T is observed: (ˆst , ˆdt ) = arg max (i,d) ηt (i, d). (10) Viterbi algorithms are the most popular dynamic programming algorithms for the maximum likelihood estimate (MLE) of state sequence of HMMs. There exist the similar algorithms for the HSMM [115,151,35,26]. Deﬁne the forward variable for the extended Viterbi algorithm by P[s1:t−d, S[t−d+1:t] = j, o1:t|λ] = δt ( j, d) max (11) s1:t−d j ∈ S, d ∈ D. δt ( j, d) represents the maximum likelihood that the partial state sequence ends at t in state ∗ is the previous state for 1 t T , j of duration d. Record the previous state that δt ( j, d) selects by Ψ (t, j, d) (t − d, i ∗ , d ∗ its duration, and (t − d) its ending time. Ψ (t, j, d) is determined by letting survived, d i∈S\{ j}, d∈D ), where i a(i,d δt−d max i, d ∗ )( j,d)b j,d(ot−d+1:t ) ,

S.-Z. Yu / Artiﬁcial Intelligence 174 (2010) 215–243 i, d δt−d a(i,d )( j,d)b j,d(ot−d+1:t ) . 221 Now the maximum likelihood state sequence can be determined by ﬁnding the last state that maximizes the likelihood. For the general assumption of the boundary conditions on page 5, the last ML state is or, for the simplifying assumption of the boundary conditions, t1 = T and δt (i, d), max tT i∈S dt−T+1, d∈D i∈S\{ j}, d∈D ∗ i , d t1, j ∗ 1, d j . . . t2, j tn, j ∗ 1 ∗ 1, d ∗ = arg max = arg = arg max i∈S d∈D = Ψ = Ψ tn−1, j ∗ 2, d ∗ ∗ n, d n t1, j ∗ 1 ∗ 2 δT (i, d). , ∗ 1, d ∗ 1 , ∗ n−1, d ∗ n−1 Trace back the state sequence by letting until the ﬁrst state S1 is determined, where S1 = j ∗ n and ( j ∗ n, d ∗ n), . . . , ( j ∗ 1, d ∗ 1) is the maximum likelihood state sequence. If the state duration density function is log-convex parametric, which is fulﬁlled by the commonly used parametric functions, Bonafonte et al. [17] empirically showed that the computational complexity can be reduced to about 3.2 times of the conventional HMM. If the model is a left–right HSMM or the particular state sequence, i1, . . . , in, is given, then only the optimal segmentation of state durations needs to be determined. This is accomplished by simply rewriting (11) as [109,110] δt (im, d) = max d∈D for 1 m n, 1 t T , d ∈ D. δt−d im−1, d a(im−1,d )(im,d)bim,d(ot−d+1:t ) , 2.2.4. Constrained estimate of states As discussed in the previous subsections, the posterior probabilities P[St = j|o1:T , λ], P[St = i, St+1 = j|o1:T , λ] and P[S[t−d+1:t] = j|o1:T , λ] for given entire observation sequence o1:T are determined by the forward–backward algorithm, and are used for the computation of various expectations and the MAP estimation of states. These posterior probabilities can be interpreted as the probabilities that the path taken (a random variable) passes through the constraint states for the given observation sequence o1:T . For example, P[S[t−d+1:t] = j, o1:T|λ] = αt ( j, d)βt ( j, d) counts for all the paths that pass through the constraint state j during the constraint period of t − d + 1 to t, where αt ( j, d) is given by (4) and βt ( j, d) by (5). This is useful for the conﬁdence calculation in the state estimation. The conﬁdence can be simply deﬁned as max j,d αt ( j, d)βt ( j, d) αt j ,d j , d j , d βt , where arg max j,d αt ( j, d)βt ( j, d) is used for the MAP estimate of the state as given by (10). Calculating the conﬁdence over every t, one can ﬁnd out in practice when the errors are most likely to occur in the state estimation. Now we compute the posterior probability P[st+1:t+k|o1:T , λ] corresponding to a segment of k observations. It is expected useful in some applications. For example, this probability can be used to estimate the conﬁdence of an individual ﬁeld of words for information extraction [41]. It can also be used for computing some expectations, which can be used for estimating the states corresponding to the segment of k observations. If one assumes that the ﬁrst state of the subsequence st+1:t+k starts at t + 1 and the last state ends at t + k, i.e., st+1:t+k = ( j1, d1) . . . ( jn, dn), s.t. d1 + ··· + dn = k, it is easy to compute the posterior probability by P[st+1:t+k, o1:T|λ] = αt+d1 ( j1, d1) a( jm−1,dm−1)( jm,dm)b jm,dm (oτm−1+1:τm ) βt+k( jn, dn), where τm = t + d1 + ··· + dm. When n = 1 the equation is reduced to (6). If we release the condition that the ﬁrst state of the subsequence starts at t + 1 or the one that the last state of the subsequence ends at t + k, the computation can be done by allowing the ﬁrst state duration d d1 and the last state duration d dn. A simple way for computing the posterior probability of a segment is modifying the forward–backward algorithm to con- form the constraints. Similar to the constrained forward–backward algorithm for a CRF (conditional random ﬁeld) proposed by Culotta and McCallum [41], the forward–backward formulas given by (4) and (5) for the HSMM can be modiﬁed. Let αt ( j, d) = 0 and βt ( j, d) = 0 when t + 1 t t + k and j = st , where st ∈ st+1:t+k is a constraint that each path must pass through and st+1:t+k is the constraint subpath. If we further constrain that the ﬁrst state of the constraint subpath starts at n m=2

222 S.-Z. Yu / Artiﬁcial Intelligence 174 (2010) 215–243 − d + 1 < t + 1. Similarly, if we constrain that t + 1, we must let αt ( j, d) = 0 and βt ( j, d) = 0 when t + 1 t = P[st+1:t+k, o1:T|λ] = − d+ 1 t + k and the last state of the constraint subpath ends at t + k, we let αt ( j, d) = 0 and βt ( j, d) = 0 when t + 1 t > t + k. Using this modiﬁed forward recursion, we obtain α d∈D α j∈S T ( j, d) be t the constrained lattice, the set of all paths that conform to the constraints st+1:t+k. Then the posterior probability is yielded /L, where L = P[o1:T|λ]. Obviously, the probabilities given by (6) to (8) can be computed using the modiﬁed forward by L recursion as well. t + k and t T ( j, d). Let L 2.3. Estimation In the preceding problems, such as the forward–backward algorithm, MAP and ML estimation of states, we assumed that the set of model parameters λ is given. If λ is unknown, we need to learn about λ from the observations o1:T : λ is initially estimated and then re-estimated so that the likelihood function L(λ) = P[o1:T|λ] increases and converges to its maximum value. If the system is slowly varying (i.e., non-stationary), the model parameters λ may need to be updated adaptively. Such training and updating process is referred to as parameter re-estimation. 2.3.1. Parameter estimate of HSMM For the parameter estimation/re-estimation problem, there is no known analytical method to ﬁnd the λ that maximizes the likelihood function. Thus, some iterative procedure must be employed. t ξt (i, d )( j,d) by ; j, d), and The model parameters λ can be re-estimated using the expectations. For instance, 1) the initial distribution ˆπ j,d can be updated by ηt ( j, d)/ j,d ηt ( j, d) for t 0, 2) the transition probabilities ˆa(i,d ; j, d)/ j=i,d 3) the observation probabilities ˆb j,d(vk1:kd ) by [ηt ( j, d) · I(ot+1:t+d = vk1:kd )]/ t ηt ( j, d), where I(ot+1:t+d = vk1:kd ) = 1 t ξt (i, d if ot+1 = vk1 , . . . , ot+d = vkd and zero otherwise. j=i t ξt (i, j), t ηt ( j, d)/ [γt ( j) · I(ot = vk)]/ d Except these parameters for the general HSMM, parameters for other HSMMs can be estimated as well, such as i) the transition probabilities ˆaij by t ξt (i, j)/ ii) the duration probabilities ˆp j(d) of state j by iii) the observation probabilities ˆb j(vk) by iv) the initial distribution ˆπ j by γ0( j)/ t ηt ( j, d), t γt ( j), and ˆp j(d) = 1, )( j,d) = 1, ˆπ j,d = 1, t j γ0( j). ˆa(i,d j=i,d d j,d t Those probability mass function or probability density function satisfy: ˆaij = 1, j=i The re-estimation procedure: b j,d(vk1:kd ) = 1, and b j(vk) = 1. vk1 ,...,vkd vk a) Assume an initial model parameter set λ0; b) For given model parameter set λk, use the forward–backward formulas (4) and (5) to compute the forward and backward variables {αt ( j, d)} and {βt ( j, d)}. Then use the forward and backward variables to compute the related prob- ; j, d), ξt (i, j) and γt ( j) by (6) through (9). Finally re-estimate the model parameters to get abilities ηt ( j, d), ξt (i, d ˆλk+1; c) Let λk+1 = ˆλk+1, k + +, and go back to step b); d) Repeat b) and c) until the likelihood L(λk) = P[o1:T|λk] converges to a ﬁxed point. 2.3.2. Order estimate of HSMM In the re-estimation algorithms discussed above, the number of hidden states, M, the maximum length of state duration, D, the number of observable values, K , and the length of the observation sequence, T , are usually assumed known in the context of applications. However, the learning issues when the order of an HSMM is unknown is sometimes particularly important in practice. A detailed discussion on the order estimate of HMMs can be found in Ephraim and Merhav [57, Section VIII], and the issues of overﬁtting and model selection in Ghahramani [66, Section 7]. However, the order estimate of HSMMs is somewhat different from that of HMMs, because HSMMs have variable durations. Therefore, for an HSMM we must estimate both the number of states, M, and the maximum length of state durations, D. In fact, some special HSMMs can be described by a dynamic Bayesian network (DBN) using a directed graphical model. For simplicity, one usually assumes the observations are conditionally independent, i.e., b j,d(ot+1:t+d) = P[ot+1:t+d|S[t+1:t+d] = j] = t+d (12) where b j(vk) P[ot = vk|St = j]. To easily identify when a segment of states starts, one usually further assumes a state transition is independent to the previous state duration. This is just the assumption made for the explicit duration HMM and τ=t+1 b j(oτ ),

分享到：

赞收藏

资料库

HSMM人工智能文档.pdf

相关推荐

课程资源

热门标签

最新资料