logo资料库

AVS2编码标准综述论文.pdf

第1页 / 共9页
第2页 / 共9页
第3页 / 共9页
第4页 / 共9页
第5页 / 共9页
第6页 / 共9页
第7页 / 共9页
第8页 / 共9页
资料共9页,剩余部分请下载后查看
[standards IN A NUTSHELL] AVS2—Making Video Coding Smarter decoding, processing, and representation of digital audio-video content, thereby enabling digital audio-video equipment and systems with highly efficient and AVS2 iS MAKiNG ViDeO CODiNG SMARTeR By ADOPTiNG iNTelliGeNT CODiNG TOOlS ThAT NOT ONly iMPROVe CODiNG effiCieNCy, BUT AlSO helP wiTh COMPUTeR ViSiON TASKS SUCh AS OBjeCT DeTeCTiON AND TRACKiNG. economical coding/decoding technolo- gies. After more than a decade, the work- ing group has published a series of standards, including AVS1, which is the culmination of the first stage of work. Table 1 shows the time line of the AVS1 video coding standard (for short, AVS1). In AVS1, six profiles were defined to meet the requirements of various applications. The Main Profile focuses on digital video applications like commercial broadcasting and storage media, including high-defini- tion video applications. It was approved as a national standard in China: GB/T 20090.2-2006. It was followed by the Enhanced Profile, an extension of the Main Profile with higher coding efficiency, targeting the needs of multimedia enter- tainment, such as movie compression for high-density storage. The Surveillance Baseline and Surveillance Profiles focus on video surveillance applications, consid- ering in particular the characteristics of surveillance videos, i.e., high noise levels, relatively low encoding complexity, and requirements for easy event detection and search. The Portable Profile targets mobile video applications with lower resolution, low computational complexity, and robust error resiliency to meet the wireless envi- ronment. The latest Broadcasting Profile is also an improvement of the Main Profile and targets high-quality, high-definition TV (HDTV) broadcasting. It was approved and published as an industry standard by the State of China Broadcasting Film and Television Administration in July 2012. AVS standards are also being recog- nized internationally. In 2007, the Main [TABle 1] TiMe liNe Of AVS1 ViDeO CODiNG STANDARD. A VS2 is a new generation of video coding standard devel- oped by the IEEE 1857 Working Group under project 1857.4. AVS2 is also the second-generation video coding standard established by the Audio and Video Coding Standard (AVS) Working Group of China; the first-generation AVS1 was developed by the AVS Working Group and issued as Chinese national standard GB/T 20090.2-2006 in 2006. The AVS Working Group was founded in 2002 and is dedicated to providing the digital audio-video industry with highly efficient and economical coding/decod- ing technologies. So far, the AVS1 video coding standard is widely implemented in regional broadcasting, communica- tion, and digital video entertainment sys- tems. As the successor of AVS1, AVS2 is designed to achieve significant coding efficiency improvements relative to the preceding H.264/MPEG4-AVC and AVS1 standards. The basic coding framework of AVS2 is similar to the conterminous HEVC/H.265, but AVS2 can provide more efficient compression for certain video applications such as surveillance as well as low-delay communication such as vid- eoconferencing. AVS2 is making video coding smarter by adopting intelligent coding tools that not only improve cod- ing efficiency but also help with com- puter vision tasks such as object detection and tracking. BACKGROUND The AVS Working Group was established in March 2002 in China. The mandate of the group is to establish generic techni- cal standards for the compression, TiMe December 2003 PROfile main TARGeT APPliCATiON(s) TV broaDcasTing June 2008 surVeillance baseline ViDeo surVeillance enhanceD DigiTal cinema sepTember 2008 July 2009 July 2011 MAjOR CODiNG TOOlS 8 # 8 block-baseD inTrapreDicTion, Transform anD Deblocking filTer; Variable block size moTion compensaTion (16 # 16 + 8 # 8) backgrounD-preDicTiVe picTure for ViDeo coDing, aDapTiVe WeighTing QuanTizaTion (aWQ), core frame coDing conTexT binary ariThmeTic coDing (cbac), aWQ backgrounD moDeling baseD coDing aWQ, enhanceD fielD coDing Digital Object Identifier 10.1109/MSP.2014.2371951 Date of publication: 12 February 2015 may 2012 broaDcasTing hDTV porTable surVeillance ViDeo surVeillance mobile ViDeo communicaTion 8 # 8/4 # 4 block Transform IEEE SIGNAL PROCESSING MAGAZINE [172] MARCh 2015 1053-5888/15©2015IEEE Siwei Ma, Tiejun Huang, Cliff Reader, and Wen Gao
Profile was accepted as an option of video codecs for Internet Protocol Television (IPTV) applications by the International Telecommunication Union–Telecommu- nication Standardization Sector (ITU-T) Focus Group on IPTV standardization [1]. The IEEE 1857 Working Group was established in 2012 to work on IEEE standards for advanced audio and video coding, based on individual members of the IEEE Standards Association from the AVS Working Group. The IEEE 1857 Working Group meets three to four times annually to discuss the standard technologies, syntax, and so on. Until now, the IEEE 1857 Working Group has finished three parts of IEEE 1857 stan- dards, including IEEE 1857-2013 for video, IEEE 1857.2-2013 for audio, and IEEE 1857.3-2013 for system [2]. AVS standards have been developed in compliance with the AVS intellectual prop- erty rights (IPR) policy. This policy includes up-front commitment by partici- pants to license essential patents with declaration of default licensing terms—roy- alty-free without compensation [(RAND- RF) and otherwise under reasonable and nondiscriminatory terms], or participation in the AVS patent pool, or RAND. The dis- closure of published patent applications and granted patents is required, and the existence of unpublished applications is also required if the RAND option is taken. The licensing terms are also considered in the adoption of proposals for AVS stan- dards when all technical factors are equal. Reciprocity in licensing is required. The protection of participants’s IPR is provided to guard against the situation in which the IPR of a participant are disclosed by another party. AVS has encouraged the establishment of a Patent Pool Administra- tion (PPA) that is independent from the AS A SUCCeSSOR Of AVS1, AVS2 iS DeSiGNeD TO iMPROVe CODiNG effiCieNCy fOR hiGheR ReSOlUTiON ViDeOS AND PROViDe effiCieNT COMPReSSiON SOlUTiONS fOR VARiOUS KiNDS Of ViDeO APPliCATiONS. AVS Working Group, which only focuses on the standards. The AVS standards are also fully compliant with the IPR policy of IEEE standards. Based on the success of AVS1 and the recent research and standardization works, AVS has been working on a new generation of video coding technologies called AVS2 (or more specifically, Part 2 in the AVS2 series standards). In fact, since 2005 and before the AVS2 project officially started, AVS has been continuously working on an AVS-X project to explore more efficient coding techniques. AVS2 was started for- mally by issuing a call for platforms in March 2012. By October 2012, a reference – Transform/ Quantization Entropy Coding Inv.Transform/ Dequantization + Loop Filter Intraprediction Interprediction Motion Estimation Frame Buffer [fiG1] The coding framework of an AVS2 encoder. IEEE SIGNAL PROCESSING MAGAZINE [174] MARCh 2015 platform (RD 1.0) based on the AVS1 refer- ence software was developed for AVS2 [3]. After that, AVS2 continued to improve its coding efficiency, and the standard in com- mittee draft 2.0 was finalized in June 2014. It has been approved as a project of IEEE standard, IEEE 1857.4, and a project of Chinese national standard, both of which are expected to be finished by the end of 2014 at the time of this writing. As a successor of AVS1, AVS2 is designed to improve coding efficiency for higher-res- olution videos and provide efficient com- pression solutions for various kinds of video applications. Compared to the preceding coding standards, AVS2 adopts smarter cod- ing tools that are adapted to satisfy the new requirements identified from emerging applications. First, more flexible prediction block partitions are used to further improve prediction accuracy, e.g., square and non- square partitions, which are more adaptive to the image content especially in edge areas. Related to the prediction structure, transform block size is more flexible and can be up to 64 # 64 pixels. After transfor- mation, context adaptive arithmetic coding is used for the entropy coding of the trans- formed coefficients. A two-level coefficient scan and coding method can encode the coefficients of large blocks more efficiently. Moreover, for low-delay communication applications, e.g., video surveillance, video conference, etc., where the background usually does not often change, a back- ground picture model-based coding method is developed in AVS2. The background picture constructed from original pictures or decoded pictures is used as a reference picture to improve prediction efficiency. Test results show that this back- ground picture-based prediction coding can improve coding efficiency significantly. Fur- thermore, the background picture can also be used for object detection and tracking for intelligent surveillance. In addition, to support object tracking among multiple cameras in surveillance applications, navi- gation information such as those from the global positioning system and BeiDou Navi- gation Satellite System of China is also defined, which mainly includes timing, location, and movement information. Finally, aiming at more intelligent surveil- lance video coding, AVS2 also started a [standards IN A NUTSHELL] continued
CU Partition CU Depth, d = 0 N0 = 32 Split Flag = 0 Split Flag = 1 0 2 1 3 2N0 CU0 2N0 2Nd × 2Nd PU_Skip /Direct PU Partition Split Flag = 0 Split Flag = 1 d = 0 2Nd × 2Nd Nd × Nd 2Nd × 0.5Nd 0.5Nd × 2Nd PU_Intra CU Depth, d = 1 N1 = 16 CU Depth, d = 2 N2 = 8 CU Depth, d = 3 N3 = 4 0 2 1 3 2N1 CU1 2N1 Split Flag = 0 Split Flag = 1 0 2 1 3 2N2 CU2 2N2 d = 1 d = 2 d = 3 2Nd × 2Nd 2Nd × Nd Nd × 2Nd Nd × Nd PU_Inter Last Depth: No Splitting Flag 2Nd × nU 2Nd × nD nL × 2Nd nR × 2Nd 2N3 CU3 2N3 n = 0.25Nd [fiG2] (a) The maximum possible recursive CU structure in AVS2. (lCU size = 64, maximum hierarchical depth = 4). (b) Possible PU splitting for skip, intramodes, and intermodes in AVS2, including symmetric and asymmetric prediction (d=1, 2 for intraprediction, and d= 0,1,2 for interprediction). digital media content description project in which visual objects in the images or videos are described with multilevel features for facilitating visual object based storage, retrieval, and interactive applications, etc. This column will provide a short overview of AVS2 video coding technol- ogy and a performance comparison with other video coding standards. TeChNOlOGy AND Key feATUReS Similar to previous coding standards, AVS2 adopts the traditional prediction/ transform hybrid coding framework, as shown in Figure 1. Within the framework, a more flexible coding structure is adopted for efficient high-resolution video coding, and more efficient coding tools are developed to make full use of the tex- tual information and temporal redundan- cies. These tools can be classified into four categories: 1) prediction coding (including intraprediction and interpre- diction), 2) transform, 3) entropy coding, and 4) in-loop filtering. We will give a brief introduction to the coding frame- work and coding tools. Coding Framework In AVS2, a coding unit (CU)-, prediction unit (PU)-, and transform unit (TU)-based coding/prediction/transform structure is adopted to represent and organize the encoded data [3]. First, pictures are split into largest coding units (LCUs), which consist of N samples of a lumi- nance component and associated chromi- nance samples with or 32. One LCU can be a single CU or can be split into four smaller CUs with a quad-tree parti- tion structure; a CU can be recursively split until it reaches the smallest CU size limit, as shown in Figure 2(a). Once the splitting of the CU hierarchical tree is N 8 16 2# N = 2 , , IEEE SIGNAL PROCESSING MAGAZINE [175] MARCh 2015 finished, the leaf node CUs can be further split into PUs. PU is the basic unit for intra- and interprediction and allows mul- tiple different shapes to encode irregular image patterns, as shown in Figure 2(b). The size of a PU is limited to that of a CU with various square or rectangular shapes. More specifically, both intra- and interpre- diction partitions can be symmetric or asymmetric. Intraprediction partitions vary in the set { 2 # while inter- prediction 0 5 . }, partitions vary in the set { 2 , # N N , , , 2 N N N # # U D L and R are the , , nR abbreviations of “Up,” “Down,” “Left,” and “Right,” respectively. Besides CU and PU, TU is also defined to represent the basic unit for transform coding and quantiza- tion. The size of a TU cannot exceed that of a CU, but it is independent of the PU size. 2 2 N N nU N nD nL N2# }, , 2 where N N N N N # , 2# 0 5 . N N N , , 2 N # # 2 , # 2 # 2 ,
Zone 3 Zone 2 Zone 1 3 5 4 7 6 8 9 10 14 11 13 12 31 32 30 28 26 22 29 27 25 23 24 18 16 15 20 21 19 17 DC: 0 Plane: 1 Bilinear: 2 [fiG3] An illustration of directional prediction modes. intraPrediCtion Intraprediction is used to reduce the redundancy existing in the spatial domain of the picture. Block partition-based direc- tional prediction is used for AVS2 [5]. As shown in Figure 2, besides the square PU partitions, nonsquare partitions, called short distance intra prediction (SDIP), are adopted by AVS2 for more efficient intralu- minance prediction [4], where the nearest reconstructed boundary pixels are used as the reference sample in intraprediction. For SDIP, a N PU is horizontally/ 2# 2 N vertically partitioned into four prediction blocks. SDIP is more adaptive to the image content, especially in edge area. But for the complexity reduction, SDIP is used in all CU sizes except a 64 # 64 CU. For each prediction block in the partition modes, a total of 33 prediction modes are supported for luminance, including 30 angular modes [5], a plane mode, a bilinear mode, and a DC mode. Figure 3 shows the distri- bution of the prediction directions associ- ated with the 30 angular modes. Each sample in a PU is predicted by projecting its location to the reference pixels applying the selected prediction direction. To improve the intraprediction accuracy, the subpixel precision reference samples must be interpolated if the projected reference samples locate on a noninteger position. The noninteger position is bounded to 1/32 sample precision to avoid floating point operation, and a four-tap linear interpola- tion filter is used to get the subpixel. , For the chrominance component, the PU size is always N N# and five prediction modes are supported, including vertical pre- diction, horizontal prediction, bilinear pre- diction, DC prediction, and the prediction mode derived from the corresponding lumi- nance prediction mode [6]. interPrediCtion Compared to the spatial intraprediction, interprediction focuses on exploiting the temporal correlation between the consec- utive pictures to reduce the temporal redundancy. Multireference prediction has been used since the H.264/AVC standard, including both short-term and long-term reference pictures. In AVS2, long-term ref- erence picture usage is extended further, which can be constructed from a sequence of long-term decoded pictures, e.g., back- ground picture used in surveillance cod- ing, which will be discussed separately later. For short-term reference prediction in AVS2, F frames are defined as a special P frame [7], in addition to the traditional P and B frames. More specifically, a P frame is a forward-predicted frame using a single reference picture, while a B frame is a bipredicted frame that consists of forward, ref_blk2 Scaled MV ref_blk1 MV Mode 3 Mode 2 Mode 4 Pixel Indicated by MV Current PU Mode 1 Distance 2 Distance 1 ref2 ref1 (a) Current Frame Distance (b) [fiG4] (a) Temporal multihypothesis mode. (b) Spatial multihypothesis mode. IEEE SIGNAL PROCESSING MAGAZINE [176] MARCh 2015 [standards IN A NUTSHELL] continued
backward, biprediction, and symmetric prediction, using two reference frames. In a B frame, in addition to the conventional forward, backward, bi- directional, and skip/direct prediction modes, symmetric prediction is defined as a special biprediction mode, wherein only one forward motion vector (MV) is coded and the backward MV is derived from the forward MV. For an F frame, besides the conventional single hypothesis prediction mode in a P frame, multihypothesis tech- niques are added for more efficient predic- tion, including the advanced skip/direct mode [8], temporal multihypothesis predic- tion mode [9], and spatial directional multi- hypothesis (DMH) prediction mode [10]. In an F frame, an advanced skip/direct mode is defined using a competitive motion derivation mechanism. Two deri- vation methods are used: one is temporal and the other is spatial. Temporal multihy- pothesis mode combines two predictors along the predefined temporal direction, while spatial multihypothesis mode com- bines two predictors along the predefined spatial direction. For temporal derivation, the prediction block is obtained by an aver- age of the prediction blocks indicated by the MV prediction (MVP) and the scaled MV in a second reference. The second ref- erence is specified by the reference index transmitted in the bit stream. For tempo- ral multihypothesis prediction, as shown in Figure 4, one predictor ref_blk1 is gen- erated with the best MV MV and a refer- ence frame ref1 is searched by motion estimation, and then this MV is linearly scaled to a second reference to generate another predictor ref_blk2. The second reference ref2 is specified by the reference index transmitted in the bit stream. In DMH mode, as specified in Figure 4, the seed predictors are located on the line crossing the initial predictor obtained from motion estimation. The number of seed predictors is restricted to eight. If one seed predictor is selected for combined prediction, for example “Mode 1,” then the index of the seed predictor “1” will be sig- naled in the bit stream. For spatial derivation, the prediction block may be obtained from one or two prediction blocks specified by the motion copied from its spatial neighboring B G C Current PU D A F [fiG5] An illustration of neighboring blocks A, B, C, D, f, and G for MVP. blocks. The neighboring blocks are illus- trated in Figure 5. They are searched in a predefined order F, G, C, A, B, D, and the selected neighboring block is signaled in the bit stream. motion VeCtor PrediCtion and Coding MVP plays an important role in interpre- diction, which can reduce the redundancy among MVs of neighboring blocks and thus save large numbers of coding bits for MVs. In AVS2, four different prediction methods are adopted, as tabulated in Table  2. Each of them has its unique usage. Spatial MVP is used for the spatial derivation of Skip/Direct mode in F frames and B frames. Temporal MVP is used for temporal derivation of Skip/Direct mode in P frames and F frames. Spatial-tempo- ral-combined MVP is used for the joint temporal and spatial derivation of Skip/ Direct mode in B frames. For other cases, median prediction is used. In AVS2, the MV is in quarter-pixel precision for the luminance component, and the subpixel is interpolated with an eight-tap DCT interpolation filter (DCT- IF) [11]. For the chrominance compo- nent, the MV derived from luminance with 1/8 pixel precision and a four-tap DCT-IF is used for subpixel interpolation [12]. After the MVP, the MV difference (MVD) is coded in the bit stream. How- ever, redundancy may still exist in MVD, and to further save coding bits of MVs, a progressive MV resolution adaptation method is adopted in AVS2 [13]. In this scheme, the MVP is firstly rounded to the nearest integer sample position, and then the MV is rounded to a half-pixel preci- sion if its distance from MVP is larger than a by a threshold. Finally, the resolu- tion of the MVD is decreased to half-pixel precision if it is larger than a threshold. 2 N 2# transForm Two-level transform coding is utilized to further compress the predicted residual. For a CU with symmetric prediction unit partition, the TU size can be N or N N# signaled by a transform split flag. Thus, the maximum transform size is 64 # 64, and the minimum transform size is 4 # 4. For the TU size 4 # 4 to 32 # 32, an integer transform (IT) that closely approximates the performance of the discrete cosine transform (DCT) is used; while for the 64 # 64 transform, a logical transform (LOT) [14] is applied to the residual. A five-three-tap integer wave- let transform is first performed on a 64 # 64 block discarding the low-high (LH), high-low (HL), and (high-high) HH- bands, and then a normal 32 # 32 IT is applied to the low-low (LL)-band. For a CU that has an asymmetric PU partition, a 2 IT is used in the first level and a nonsquare transform [15] is used in the sec- ond level, as shown in Figure 6. Moreover, in the latest AVS2 standard, a secondary transform was adopted for intraprediction residual (for more details see the latest AVS specification document N2120 on the AVS FTP Web site [21]). N 2# N entroPy Coding After transform and quantization, a two- level coding scheme is applied to the [TABle 2] MV PReDiCTiON MeThODS iN AVS2. MeThOD meDian spaTial Temporal spaTial-Temporal combineD DeTAilS using The meDian mV Values of The neighboring blocks. using The mVs of spaTial neighboring blocks. using The mVs of Temporal collocaTeD blocks. using The Temporal mVp firsT if iT is aVailable, anD spaTial mVp is useD insTeaD if The Temporal mVp is noT aVailable. IEEE SIGNAL PROCESSING MAGAZINE [177] MARCh 2015
2N × nU 2N × nD nL × 2N nR × 2N PU Others 2N × 2N Split 2N × 2N Split 2N × 2N Split Level 0 TU Level 1 2N × 0.5N 0.5N × 2N [fiG6] A PU partition and two-level transform coding. (a) (b) (c) [fiG7] A subblock scan for transform blocks of size (a) 8 # 8, (b) 16 # 16, and (c) 32 # 32 transform blocks; each subblock represents a 4 # 4 CG. A B (a) A A B B (b) (c) [fiG8] A subblock region partitions of 4 # 4 CG in an intraprediction block. IEEE SIGNAL PROCESSING MAGAZINE [178] MARCh 2015 transform coefficient blocks [16]. A coeffi- cient block is partitioned into 4 # 4 coef- ficient groups (CGs), as shown in Figure 7. Then zig-zag scanning and con- text-adaptive binary arithmetic coding (CABAC) is performed at both the CG level and coefficient level. At the CG level for a TU, the CGs are scanned in zig-zag order, and the CG position indicating the position of the last nonzero CG is coded first, followed by a bin string of significant CG flags indicating whether the CG scanned in zig-zag order contains non- zero coefficients. At the coefficient level, for each nonzero CG, the coefficients are further scanned into the form of (run, level) pair in zig-zag order. Level and run refer to the magnitude of a nonzero coeffi- cient and the number of zero coefficients between two nonzero coefficients, respec- tively. For the last CG, the coefficient posi- tion that denotes the position of the last nonzero coefficient in scan order is coded first. For a nonlast CG, a last run is coded that denotes number of zero coefficients after the last nonzero coefficient in zig- zag scan order. And then the (level, run) pairs in a CG are coded in reverse zig-zag scan order. For the context modeling used in the CABAC, AVS2 employs a mode-depen- dent context selection design for intra- prediction blocks [17]. In this context design, 34 intraprediction modes are classified into three prediction mode sets: vertical, horizontal, and diagonal. Depending on the prediction mode set, each CG is divided to two regions, as shown in Figure 8. The intraprediction modes and CG regions are applied in the context coding of syntax elements including the last CG position, last coef- ficient position, and run value. in-looP Filtering Artifacts such as blocking artifacts, ring- ing artifacts, color biases, and blurring artifacts are quite common in com- pressed video, especially at medium and low bit rate. To suppress those artifacts, deblocking filtering, sample adaptive off- set (SAO) filtering [18], and adaptive loop filter (ALF) [19] are applied to the reconstructed pictures sequentially. [standards IN A NUTSHELL] continued
Deblocking filtering aims to remove the blocking artifacts caused by block transform and quantization. The basic unit for the deblocking filter is an 8 # 8 block. For each 8 # 8 block, the deblocking filter is used only if the boundary belongs to either of the CU, PU, or TU boundaries. After the deblocking filter, an SAO fil- ter is applied to reduce the mean sample distortion of a region, where an offset is added to the reconstructed sample to reduce ringing artifacts and contouring artifacts. There are two kinds of offset: edge offset (EO) and band offset (BO) mode. For the EO mode, the encoder can select and signal a vertical, horizontal, downward-diagonal, or upward-diagonal filtering direction. For BO mode, an off- set value that directly depends on the amplitudes of the reconstructed samples is added to the reconstructed samples. ALF is the last stage of in-loop filtering. There are two stages in this process. The first stage is filter coefficient derivation. To train the filter coefficients, the encoder classifies reconstructed pixels of the lumi- nance component into 16 categories, and one set of filter coefficients is trained for each category using Wiener–Hopf equa- tions to minimize the mean squared error between the original frame and the recon- structed frame. To reduce the redundancy between these 16 sets of filter coefficients, the encoder will adaptively merge them based on the rate- distortion performance. At its maximum, 16 different filter sets can be assigned for the luminance component and only one for the chrominance compo- nents. The second stage is a filter decision, which includes both the frame level and LCU level. First, the encoder decides whether frame-level adaptive loop filtering is performed. If frame level ALF is on, then the encoder further decides whether the LCU level ALF is performed. smart sCene Video Coding More and more videos being captured in specific scenes (such as surveillance video and videos from the classroom, home, courthouse, etc.) are characterized by a temporally stable background. The redun- dancy originating from the background could be further reduced. AVS2 developed a background picture model-based coding method [20], which is illustrated in Figure 9. G-pictures and S-pictures are defined to further exploit the temporal redundancy and facilitate video event gen- eration such as object segmentation and motion detection. The G-picture is a spe- cial I-picture, which is stored in a separate background memory. The S- picture is a special P-picture, which can be only pre- dicted from a reconstructed G-picture or a virtual G-picture, which does not exist in the actual input sequence but is modeled from input pictures and encoded into the stream to act as a reference picture. The G-picture is initialized by back- ground initialization and updated by background modeling with methods such as median filtering, fast implementation Raw Video G-Picture Initialization Background Modeling – DCT&Q Decoder Entropy Coding Bit Stream IQ and IDCT + MC/ Intraprediction Reconstruction Buffer S-Picture Decision Background Reference Selection Background Difference Prediction ME Loop Filter Reference Memory Background Memory [fiG9] A background picture-based scene coding in AVS2. (a) (b) (c) [fiG10] examples of the background picture and the difference frame between the original picture and the background picture: (a) original picture, (b) difference frame, and (c) background picture. IEEE SIGNAL PROCESSING MAGAZINE [179] MARCh 2015
) B d ( R N S P 42 41 40 39 38 37 36 35 34 33 32 Main Road AVS2 HEVC 8,000 10,000 12,000 0 2,000 4,000 6,000 kb/s (a) ) B d ( R N S P 39 38 37 36 35 34 33 32 31 30 29 Over a Bridge AVS2 HEVC 2,000 2,500 3,000 0 500 1,000 1,500 kb/s (b) [fiG11] A performance comparison between AVS2 and heVC for surveillance videos: (a) main road and (b) over a bridge. of a Gaussian mixture model, etc. In this way, the selected or generated G- picture can well represent the background of a scene with rare occluding foreground objects and noise. Once a G-picture is obtained, it is encoded and the recon- structed picture is stored into the back- ground memory in the encoder/decoder and updated only if a new G- picture is selected or generated. After that, S- pictures can be involved in the encod- ing process by an S-picture decision. Except that it uses a G-picture as a refer- ence, the S-picture owns similar properties as the traditional I-picture such as error resilience and random access (RA). There- fore, the pictures that should be coded as traditional I-pictures can be candidate S-pictures, such as the first picture of one group of pictures, or scene change, etc. Besides bringing about more prediction opportunity for those background blocks that normally dominate a picture, an additional benefit from the background picture is a new prediction mode called background difference prediction, as shown in Figure 10, which can improve foreground prediction performance by excluding the background influence. It can be seen that, after background differ- ence prediction, the background redun- dancy is effectively removed. Furthermore, according to the predication modes in the AVS2 compression bit stream, the blocks of an AVS2 picture could be classified as back- ground blocks, foreground blocks, or blocks on the edge area. Obviously, this information is very helpful for possible subsequent vision tasks such as object detection and tracking. Object-based cod- AVS2 hAS BeeN DeVelOPeD iN ACCORDANCe wiTh AVS AND ieee iPR POliCieS TO eNSURe RAPiD liCeNSiNG Of eSSeNTiAl PATeNTS AT COMPeTiTiVe ROyAlTy RATeS. ing has already been proposed in MPEG-4; however, object segmentation remains a challenging problem, which constrains the application of object-based coding. Therefore AVS2 uses simple background modeling instead of accurate object seg- mentation, which is easier and provides a good tradeoff between coding efficiency and complexity. To provide convenience for applica- tions like event detection and searching, AVS2 added some novel high-level syntax to describe the region of interest (ROI). In the region extension, the region number, event ID, and coordinates for top left and bottom right corners are included to show what number the ROI is, what event hap- pened, and where it lies. PeRfORMANCe COMPARiSON The major target applications of AVS2 are high-quality TV broadcasting and scene videos. For high-quality broadcasting, RA is necessary and may be achieved by inserting intraframes at a fixed interval, e.g, 0.5 s. And for high-quality video cap- ture and editing, all intracoding (AI) is required. For scene video applications, e.g., video surveillance or videoconference, low delay (LD) needs to be guaranteed. According to the applications, we tested [TABle 3] BiT RATe SAViNG Of AVS2 PeRfORMANCe COMPARiSON wiTh AVS1 AND heVC. SeqUeNCeS uhD 1080p 1200p sD oVerall Ai CONfiGURATiON RA CONfiGURATiON AVS2 VeRSUS AVS1 31.2% 33% AVS2 VeRSUS heVC 2.4% 0.8% AVS2 VeRSUS AVS1 50.3% 50.3% AVS2 VeRSUS heVC −0.4% 0.3% lD CONfiGURATiON AVS2 VeRSUS heVC 32.1% 1.6% 50.3% −0.1% 37.9% 26.2% 32.1% IEEE SIGNAL PROCESSING MAGAZINE [180] MARCh 2015 [standards IN A NUTSHELL] continued
分享到:
收藏