视频会议系统（英文文献）.pdf

发布时间：2022-06-09 发布人：admin 分类：说明书资料大小：2.94M 资料格式：pdf 举报版权申诉

tianchengshu-2308207-4744302543003027471.pdf-第1页.png

第1页 / 共13页

tianchengshu-2308207-4744302543003027471.pdf-第2页.png

第2页 / 共13页

tianchengshu-2308207-4744302543003027471.pdf-第3页.png

第3页 / 共13页

tianchengshu-2308207-4744302543003027471.pdf-第4页.png

第4页 / 共13页

tianchengshu-2308207-4744302543003027471.pdf-第5页.png

第5页 / 共13页

tianchengshu-2308207-4744302543003027471.pdf-第6页.png

第6页 / 共13页

tianchengshu-2308207-4744302543003027471.pdf-第7页.png

第7页 / 共13页

tianchengshu-2308207-4744302543003027471.pdf-第8页.png

第8页 / 共13页

文本预览

EURASIP Journal on Applied Signal Processing 2002:6, 622–634 c 2002 Hindawi Publishing Corporation A Standard-Compliant Virtual Meeting System with Active Video Object Tracking Chia-Wen Lin Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi 621, Taiwan Email: cwlin@cs.ccu.edu.tw Yao-Jen Chang Department of Electrical Engineering, National Tsing Hua University, Hsinchu 300, Taiwan Email: kc@benz.ee.nthu.edu.tw Chih-Ming Wang Department of Electrical Engineering, National Tsing Hua University, Hsinchu 300, Taiwan Email: neil@benz.ee.nthu.edu.tw Yung-Chang Chen Department of Electrical Engineering, National Tsing Hua University, Hsinchu 300, Taiwan Email: ycchen@ee.nthu.edu.tw Ming-Ting Sun Information Processing Laboratory, University of Washington, Seattle, WA 98195, USA Email: sun@ee.washington.edu Received 2 October 2001 and in revised form 24 March 2002 This paper presents an H.323 standard compliant virtual video conferencing system. The proposed system not only serves as a multipoint control unit (MCU) for multipoint connection but also provides a gateway function between the H.323 LAN (local- area network) and the H.324 WAN (wide-area network) users. The proposed virtual video conferencing system provides user- friendly object compositing and manipulation features including 2D video object scaling, repositioning, rotation, and dynamic bit-allocation in a 3D virtual environment. A reliable, and accurate scheme based on background image mosaics is proposed for real-time extracting and tracking foreground video objects from the video captured with an active camera. Chroma-key insertion is used to facilitate video objects extraction and manipulation. We have implemented a prototype of the virtual conference system with an integrated graphical user interface to demonstrate the feasibility of the proposed methods. Keywords and phrases: video conference, virtual meeting, immersive video conference, networked multimedia, multipoint con- trol unit, MCU, multimedia over IP, object segmentation, object tracking. INTRODUCTION 1. With the rapid growth of multimedia signal processing and communication, virtual meeting technologies are becoming possible. A virtual meeting environment provides the re- mote collaborators advanced human-to-computer or even human-to-human interfaces so that scientists, engineers, and businessmen can work and conduct business with each other as if they were working face-to-face in the same environment. The key technologies of virtual meeting can also be used in many applications such as telepresence, remote collabora- tion, distance learning, electronic commerce, entertainment, Internet gaming, and so forth. A virtual conferencing prototype, personal presence sys- tem (PPS), which provides user presentation control (e.g., scaling, repositioning, etc.) was ﬁrstly proposed in [1, 2]. Due to the large computational demand, a powerful dedi- cated hardware is required for supporting the virtual meet- ing functionality, thereby leading to a high implementa- tion cost. Recently, several works have been done with a fo- cus on the development of avatar-based virtual conferenc- ing systems. For example, a virtual chat room application, V- Chat [3], has been developed to provide a 3D environment with 2D cartoon-like characters, which is capable of send- ing text messages and performing some predeﬁned actions.

A Standard-Compliant Virtual Meeting System with Active Video Object Tracking 623 H.323 terminal H.323 terminal H.323 terminal VCMCU H.323 terminal functions Conversion functions (MC/MP) H.324/l terminal functions H.324/l terminal H.324/l terminal ISDN ... ISDN MP: multipoint processor MC: multipoint controller LAN Figure 2: The conceptual model of the proposed VCMCU. the ultimate physical layer. H.323 includes many mandatory or optional component standards and protocols such as au- dio codec (G.711/G.722/G.723.1/G.728/G.729), video codec (H.261/263), data conferencing protocol (T.120), call signal- ing, media packet formatting and synchronization protocol (H.225.0), and system control protocol (H.245) which de- ﬁnes a message syntax and a set of protocols to exchange multimedia messages. In this paper, we present a low-cost virtual meeting sys- tem which fully conforms to the H.323 standard. We pro- pose a virtual conference multipoint control unit (VCMCU) which adopts the H.263 [11] video coding standard. The proposed virtual meeting system involves 2D natural video objects and 3D synthetic environment. We propose a real- time, reliable, and accurate scheme for extracting and track- ing foreground video objects from the video captured with an active camera (i.e., a camera with a certain range of pan, tilt, and zoom-in/out control). After object segmentation, we use chroma-key-based schemes so that the developed tech- niques can be adopted in H.263 compatible systems with- out sending the video object shape information. The concept can also be easily extended to MPEG-4 based systems where the shape information is already included in the coded video data. The rest of this paper is organized as follows. In Section 2, we discuss the architecture of the proposed virtual video con- ferencing system. Section 3 describes an active video object extraction and tracking scheme based on background mo- saicking for real-time segmentation. Section 4 presents the video object compositing and manipulation scheme in the proposed system. Finally, conclusions are given in Section 5. 2. PROPOSED SYSTEM ARCHITECTURE The proposed VCMCU is a PC-based prototype which per- forms the multimedia communications and protocol trans- lations among multiple LAN and WAN terminals. The con- ceptual model of the proposed VCMCU is shown in Figure 2. The LAN users can establish a multipoint video conference session with the WAN (ISDN) users via the VCMCU. Typ- ically a gateway is used for point-to-point communication. When there are more than two endpoints to hold a con- ference, an MCU is required to control and manage the Figure 1: A virtual meeting with 2D video objects manipulated against a 3D virtual environment. While the 2D avatar of V-Chat provides acceptable represen- tation, several research works have been conducted on seek- ing for 3D avatar solutions such as the virtual space tele- conferencing system proposed by Ohya et al. [4], the vir- tual life network (VLNet) with life-like virtual humans pro- posed by C¸ apin et al. [5], and the networked intelligent col- laborative environment (NetICE) with an immersive visual and aural environment and speech-driven avatars proposed by Leung et al. [6]. Although the synthetic avatar-based solu- tions usually consume a small bandwidth, they may not pro- vide satisfactory viewing quality due to the immature tech- nologies to date. Hence, an eﬃcient alternative is to adopt natural video instead of synthesizing facial expressions on a 3D avatar. In the FreeWalk system developed by Nakanishi et al. [7], the avatar is represented by a pyramid of 3D polygons with user’s video attached on one ﬂat rectangular face. A 3D environ- ment is also provided, where users can freely navigate to perform casual meetings. The InterSpace system proposed in [8, 9] also provides similar functionalities with avatars containing a 3D body with a synthesized computer moni- tor showing user’s video as the head of the 3D body. The use of natural video indeed increases the viewing quality and is more suitable for real-time communications. How- ever, the direct use of the natural video containing the back- ground from real world of diﬀerent users would degrade the sense of immersion. It is more preferable to have the background removed. Figure 1 illustrates a scenario of vir- tual video conferencing which involves 2D natural video ob- jects (from the remote conferees) manipulated in a synthetic 3D virtual environment. In order to make this possible, sev- eral key technologies need to be incorporated. For exam- ple, we need to be able to segment the conferees involved and compose them together into a single scene in real-time so that these video objects appear interacting in the same environment. To facilitate audio and video communications with ex- isting coding standards, we adopt ITU-T H.323 multime- dia communication standard [10] as the basic system archi- tecture. ITU-T H.323 is the most widely adopted standard to provide audiovisual communications over LANs. H.323 can be used in any packet-switched network, regardless of

624 EURASIP Journal on Applied Signal Processing H.323 client Server (VCMCU) H.324/l client Vtalk323 Multipoint controller Vtalk324/l AV codecs G.711, G.723.1, H.263 T.120 datacom System control H.245, H.225.0 System control H.245, H.225.0, H.223 Video proc. Audio proc. T.120 data server Session control Session control System control H.245, H.223 T.120 data AV codecs G.723.1, H.263 LAN ISDN Figure 3: The proposed VCMCU system architecture. multipoint session. However, in practical applications, using two separate devices to perform the gateway and the confer- ence control functions is expensive and undesirable. Figure 3 depicts the system architecture of the VCMCU. The VCMCU not only serves as an MCU but also plays the role of a gate- way to interwork between the H.323 (for LAN) and H.324 [12] (for WAN) client terminals. The client side is basically an H.323 or H.324 multimedia terminal, which generates an H.323- or H.324-compliant bit-stream with integrated au- dio, video, and data content. The VCMCU server receives and terminates bit-streams from the H.323 and H.324 client ter- minals as well as performs the multipoint controller (MC) and multipoint processor (MP) functions [10]. The MC and MP are used for protocol conversion and bandwidth adap- tation, respectively. In fact, the VCMCU server also includes the complete functions of the H.323/H.324 client terminals. The VCMCU supports two operation modes. One is the base-line mode, and the other is the advanced virtual meet- ing mode. Figure 4 shows a snapshot of the baseline-mode user-interface of the VCMCU. It supports the split-screen display format so that four subwindows can be seen simul- taneously in a continuous presence fashion. To balance the asymmetrical bandwidth requirement of one incoming and four outgoing video streams, a dynamic bit-allocation video transcoding scheme, described in [13, 14], is used in the VCMCU to allocate appropriate bit-rates to the subwindows by taking into account the joint spatio-temporal complexity of each subwindow. The client terminals may also send their preferences on the conferee bit-allocations to the VCMCU to favor the quality of the video objects of interest. Figure 5 shows the proposed server-client architecture for video processing used in the advanced virtual meeting mode. At the client side, the video object of each conferee is ﬁrst extracted. Then a chroma-key [15] is ﬁlled in the background. After the chroma-key insertion, the client video is encoded using an H.263 video coder, then the resultant bit-stream is sent to the VCMCU. The server extracts all Figure 4: The baseline-mode user-interface for client terminals. the client video objects according to the chroma-key infor- mation, transcodes the video objects to adapt to the avail- able network bandwidth and the user demands, and inserts the chroma-key ﬁlled background again. The server then sends back to each client the transcoded video objects with chroma-keyed background. The clients decode the received bit-stream, extract the chroma-keyed objects, then compose and render the 2D video objects in a pre-stored 3D synthetic environment. In the process described above, object extraction, track- ing, compositing, and manipulation are in general the most computationally demanding tasks. To meet the real-time re- quirement, we propose a fast object segmentation and track- ing scheme, as will be described in the following, which

A Standard-Compliant Virtual Meeting System with Active Video Object Tracking 625 Object extraction & chroma-key insertion H.263 video encoder Client 2, Client 3, Client 4 Client 1 (video capture) 3D virtual conference environment setup Object extraction Client 2 Client 3 Client 4 Network 4 signals H.263 video decoder Object extraction 4 signals 4 signals R-D optimized object-based bit- allocation 4 signals H.263 video decoder Network H.263 video encoder Video object compositing Figure 5: Video processing architecture used in the advanced virtual meeting mode. uses pre-stored background information, and a chroma- key-based object extraction scheme. We also propose a ta- ble look-up based geometrical transformation method for real-time manipulating and rendering the video objects in the pre-stored 3D synthetic background. The compu- tation required for the proposed object compositing, ma- nipulating, and rendering can also be done in the video display cards if they support OpenGL technologies, so as to reduce the computation burden at the client terminals drastically. 3. VIDEO OBJECT EXTRACTION AND TRACKING WITH AN ACTIVE CAMERA In general, video object segmentation is computationally very expensive. For virtual video conferencing applications, a fully automatic real-time segmentation is needed, which must be able to detect a foreground object even if it is still since the users may keep still for a long time. The segmenta- tion should also have pixel-wise shape accuracy, and be ro- bust against noise and lighting changes. Theses requirements prevent us from using most of the prior segmentation meth- ods [16, 17, 18], which may require human interaction, may fail to extract video objects which keep still for a long time, cannot be operated in real-time, or cannot provide pixel-wise shape accuracy. Automatic video object extraction is a diﬃcult problem in general situations. However, in a virtual video conferenc- ing system, it is relatively easy to pre-capture the background, which can be useful for the automatic extraction of the foreground objects. Background subtraction is an eﬃcient method to discriminate moving objects from the still back- ground [19, 20, 21, 22]. The idea of background subtraction is to subtract the current image from the still background, which is acquired before the objects move in. After sub- traction, only nonstationary or new objects are left. This method is especially suitable for video conferencing [21, 22] and surveillance applications [19], where the backgrounds remain still during the conference or monitoring time. Nev- ertheless, there are still many annoying factors such as simi- lar color appearing in both foreground and background ar- eas, changing of lighting conditions, and camera noise which have prevented us from using a simple diﬀerence and thresh- old method to automatically segment the video objects. Furthermore, using a static camera may restrict the fore- ground object to be in a very limited view which is not satis- fying in many applications. Conventional background analy- sis schemes could not be applied easily to images taken from an active camera. Although a few references, [23, 24], have addressed the problem of object segmentation and track- ing using an active camera, they cannot segment the mov- ing object in an arbitrary pan and tilt angle. Therefore, they must store all the views in ﬁxed pan and tilt angles, which is very ineﬃcient and unﬂexible. Moreover, in video confer- encing applications, the above operations need to be done in real-time, making computational complexity also a major concern. To overcome the inconvenience and limitation caused by a static camera, we propose a real-time object extraction and tracking scheme using a mosaic technique with an ac- tive camera. Figure 6 depicts the conceptual diagram of our proposed scheme, which is aimed at the segmentation of the

626 EURASIP Journal on Applied Signal Processing Tracking Background substraction Standby Object detection Yes Background model Update No Extraction Background Foreground Figure 6: The conceptual diagram of the proposed object segmentation and tracking scheme. foreground object from the background scene and tracking the moving object with an active camera such that the mov- ing object always stays around the center of images. In video conferencing applications, the system requires a robust seg- mentation algorithm with a pixel-wise accuracy, a reliable tracking strategy, and a low computational complexity. Prior to performing segmentation and tracking, a num- ber of background images with equally spaced pan and tilt angles are captured and analyzed. A panoramic mosaic image of the background is then automatically constructed from those background images as the reference background model in the segmentation process. After the mosaic image has been constructed, the live video captured from the camera is fed into the detection module. The detection module monitors any change in the scene and activates the segmentation mod- ule when an intruding object is detected. As the segmentation mechanism is activated, the foreground object is extracted from the background and the extracted foreground is uti- lized as the basis to control the active camera to track the moving object. In addition, the separated background is uti- lized to update the corresponding background model to im- prove the segmentation result. In our system, an active cam- era is used, which enables the moving object (foreground) to be extracted from the background with arbitrary pan and tilt angles, and the object can move in a wide range of background. Image mosaics construction 3.1. The ﬁrst step to constructing the panoramic mosaic is the alignment of images, that is, to estimate the global motion between the consecutive images, as shown in Figure 7. The following 2-parameter translation motion model is used for images alignment: x = x + a, y = y + b. (1) ) is determined by minimiz- The global motion vector (a∗, b∗ ing a speciﬁed error function in the overlapping region: (a∗, b∗ ) = argmin E(a, b) (a,b) I1(x + a, y + b) − I0(x, y) (x,y)∈S 2 = argmin (a,b) (x,y)∈S 1 as follows: (2) , M xm,ym = (1 − α) · Mxm,ym + α · fx,y (4) k i j Mosaic image Raw video frames frame i frame j frame k Figure 7: Estimation of the global motion and construction of panoramic mosaic image from consecutive frames. where I0 and I1 are two consecutive background images; S stands for the overlapping region. Once the global motion of the consecutive frames is obtained, these frames can be integrated into a panoramic mosaic. Pixel blending is used to reduce the discontinuities in color and in luminance. In addition to constructing the panoramic mosaic, our goal is to construct an accurate back- ground model for object extraction. We propose to use an exponential weighting function to blend the overlapping re- gions, as shown in the following equation and in Figure 8, − 2 x − xc Cx y − yc Cy + 2 , (3) wx,y = exp where wx,y is the weight at the (x, y) position in the image to be blended, and (xc, yc) represents the central position of the image. The mosaic image blended with the exponential weighting function is more seamless than that with linear blending functions. Another reason is that the center region in the image is more suitable for background subtraction and the weights around the central regions should be larger than the boundary regions. The blended pixel value of the mosaic image is computed

A Standard-Compliant Virtual Meeting System with Active Video Object Tracking 627 e u l a v g n i t h g i e w l a i t n e n o p x E 1 0.6 0.5 0.4 0.2 0 100 50 150 100 50 0 0 Figure 8: Exponential blending weighting function. with w m α = w f (x, y) w xm, ym m = wm xm, ym , xm, ym + w f (x, y), (5) where Mxm,ym is the original pixel value in the mosaic image, M xm,ym is the updated pixel value, fx,y is the pixel value of the incoming image to be integrated into the mosaic, wm(xm, ym) is the weight at (xm, ym) in the mosaic, and w f (x, y) is the weight at (x, y) in the incoming image. Figure 9 shows an example of a subview in the mosaic image. The panoramic mosaic image is constructed from 15 views taken from equally spaced pan and tilt angle positions. In our method, the mosaic image is to provide an initial rough reference background model for the background sub- traction method, and the background model is then reﬁned gradually according to the segmentation result. 3.2. Object segmentation based on background subtraction 3.2.1 Background subtraction Figure 10 depicts the proposed object segmentation and tracking procedure with an active camera using background subtraction. In constructing the mosaic image, each pixel value in each view may change over a period of time due to camera noises and illumination ﬂuctuations by lighting sources. Therefore, each view is analyzed over several sec- onds of video, and is then modeled by representing each pixel value with two parameters, mean and standard deviation, during the analyzing period, as follows: µ(p) = 1 N Ri(p), N 1 N i=1 N i=1 std(p) = i (p) − µ2(p), R2 Figure 9: The mosaic image constructed as a reference background. where p presents the index of pixels in the pre-captured back- ground frames; Ri(p) is a vector with the luminance and chrominance values of the pixel p in the ith background frame; µ(p) and std(p) represent the mean and the standard deviation of the luminance and chrominance values of the pixel p during the N analyzed background frames in the view. After calculating the background model parameters for each pixel, those diﬀerent views are then fused into a mosaic image in which the model parameters are blended with (4). There- fore, each pixel in the mosaic image has two parameters: the mean µM(p) and the standard deviation stdM(p). The criterion to classify pixels is described as follows: C(p) − µM(p) > k × stdM(p) if p ∈ foreground else p ∈ background where C(p) is the pixel value of position p in the current frame to be segmented; µM(p), stdM(p) are the correspond- ing mean and the standard deviation of the subview in the mosaic background image, respectively; k is a constant used to control the threshold for segmentation. To locate the subview in the mosaic image as the corre- sponding background model, there are ﬁve steps: (1) get the camera pan and tilt angle position from the ac- tive camera and use these parameters to roughly locate the subview in the mosaic image, (2) segment and remove the foreground in the current frame by the background subtraction method, (3) use the remaining background in the current frame to ﬁnd the more accurate subview in the mosaic image, (4) update the corresponding background model, (5) iteratively repeat steps (2), (3), and (4) until the corre- sponding background model is stable. (6) 3.2.2 Post-ﬁltering Background subtraction can roughly classify pixels of back- ground and foreground, but the resultant segmentation result may still be quite noisy due to camera noises,

628 EURASIP Journal on Applied Signal Processing Camera pan, tilt angle Mosaic image Background model Updating background Background Background subtraction Checking neigbourhood Morphological operations Region growing Template matching Foreground Current frame Delay Figure 10: Procedure of the proposed object extraction and tracking with an active camera. p1 p4 p6 p2 p p7 p3 p5 p8 Figure 11: Neighborhood majority voting. illumination variations, and inappropriate threshold selec- tions. Some post-ﬁltering operations are subsequently per- formed to reﬁne the segmentation result. To mitigate the distortion of the corresponding back- ground model, we assume that there may be one-pixel dis- placement error between the current frame and the back- ground model. Then, each pixel p in the current frame, as shown in Figure 11, is classiﬁed as either a background or a foreground class by considering not only the corresponding pixel in the background model but also the eight neighbors of the corresponding pixel. The neighborhood majority-voting criterion is described as follows: count = 0; for > k × stdM pi ∈ Neighborhood(p) if C(p) − µM pi pi count = count + 1; if (count ≥ 3) p ∈ foreground p ∈ background. else After the neighborhood majority voting, a reﬁned alpha mask [25] is obtained. Then, median ﬁltering is performed to reduce noises. Next, morphological opening and closing operations are used for further improving the segmentation result. 3.2.3 Region growing At the ﬁnal step of object discrimination, a region growing method is used to reduce coarse granular noises in the alpha plane. It preserves the region of interest in the alpha plane and erases the other regions, which are considered as back- ground noises. In the region growing procedure, the seed point in the interested region must be chosen ﬁrst. We pro- pose two ways to select the seed point in the system. One is to use “integral projection” [26] with the alpha plane to obtain the seed point. And the other is to calculate the centroid of the skin-color region in the frame as the seed point for re- gion growing because the human face is usually the region of interest in our system. In our method, the integral projection method is adopted to obtain the seed point when the area of the skin-color region is less than a threshold, otherwise, the centroid of the skin-color is used. The procedure of the region growing method which ob- tains the seed point using integral projection is summarized as follows: (1) project the alpha plane F(x, y) using horizontal in- tegral projection and vertical integral projection as shown in Figures 12b and 12c: Proj H(y) = Proj V(x) = F(x, y), F(x, y), (7) x y where Proj H(y) and Proj V(x) are the horizontal inte- gral projection and vertical integral projection, respec- tively, (2) ﬁnd the index of each maximum value in Proj H(y) and ProjV(x): SeedPointx = argmax SeedPointy = argmax x y Proj V(x) Proj H(y) , (8) ,

A Standard-Compliant Virtual Meeting System with Active Video Object Tracking 629 may fail to obtain an accurate and robust segmentation when the camera moves while tracking the object, for the corre- sponding background model becomes inaccurate. The back- ground update mechanism may also fail when the camera moves because the background in the incoming image is changing. In addition, the aforementioned iterative proce- dure for ﬁnding the corresponding background model in the mosaic image will cause delay in the system when the active camera moves. In our method, as shown in Figure 10, when camera moves, we get the camera pan and tilt angle position through an RS-232 port and then we only use these parameters to roughly locate the subview in the mosaic image with- out iteratively ﬁnding the corresponding background model. This can save a lot of computation. To reﬁne the roughly segmented result, a template matching and object tracking method is adopted. Each pixel value of the current extracted object Cn(p) is matched to the corresponding one of the pre- vious extracted object Cn−1(p) for reducing the segmentation noises in Cn(p). Tracking the moving object is a challenging task because searching for features and obtaining the correspondence be- tween consecutive frames is not easy. To reduce the com- plexity and computation of the tracking, only the centroid of skin-color region in the extracted object is selected. In this way, the proposed strategy achieves robust tracking without any prior knowledge of the object shape. The tracking is per- formed as follows. 3.3.1 Template matching The correspondence problem can be formulated as a back- ward matching motion estimation problem, similar to that employed in predictive video compression. For fast and ro- bust template matching, we adopted the diamond search al- gorithm proposed in [27]. The template matching criterion is described as follows: if Cn p1 ∈ foreground ∈ foreground ∈ foreground ∈ background ﬁnd Cn−1 if Cn−1 p2 Cn p1 else Cn p2 p1 , which corresponds to Cn p1 F(x, y) Proj H(y) Proj V(x) (a) (c) (b) F (x, y) (d) Figure 12: Integral projection process: (a) alpha plane with coarse granular noises; (b) horizontal integration result from the alpha plane and dotted lines indicate the maximum value; (c) vertical in- tegration result from the alpha plane and dotted lines indicate the maximum value; (d) segmentation result after removing the coarse granular noises. (3) retain the region containing the seed point, and erase the other regions. To ﬁnd the skin-color region of a captured image, we adopted the method described in [26]. In the skin-color de- tection method, a pre-trained Gaussian model with the Cb and Cr chrominance components is utilized for character- izing the probability distribution of skin-color. By adopting this probability model, a pixel is classiﬁed as the skin-color class when the probability of its color tone being skin-color is higher than a predetermined threshold. After skin-color segmentation, the centroid of the skin-color region is then computed. 3.2.4 Background update The separated background in the incoming frame is utilized to update the intensity mean of corresponding background model, µM(p), obtained from the mosaic image. The update mechanism is as follows: If C(p) ∈ foreground µ M(p) = µM(p) else M(p) = (1 − η)µM(p) + ηC(p). µ Since the standard deviation of each pixel in the mosaic background model usually does not have signiﬁcant varia- tions during the time, it is not updated. The update mech- anism is very eﬀective in improving the segmentation result and it can also resist to the slow changes in lighting condi- tions in the images. 3.3. Object segmentation and tracking with an active camera 3.3.2 Moving object tracking In the tracking mode, the active camera is controlled to put the moving object in the central region of the captured image according to the selected feature point (e.g., the centroid of the skin-color region). A linear trajectory model is used to predict the feature point’s future position. Suppose that the feature point position is at p(t) at time t. We can predict the feature point position at time t + 1,p(t + 1), by assuming a constant velocity v which is obtained from the two previous frames, p(t + 1) = p(t) + ν(t). (9) The proposed segmentation method mentioned above works well when the active camera keeps stationary. However, it ν(t) = p(t) − p(t − 1),

分享到：

赞收藏

资料库

视频会议系统（英文文献）.pdf

相关推荐

课程资源

热门标签

最新资料