[standards IN A NUTSHELL]
AVS2—Making Video Coding Smarter
decoding, processing, and representation
of digital audio-video content, thereby
enabling digital audio-video equipment
and systems with highly efficient and
AVS2 iS MAKiNG
ViDeO CODiNG
SMARTeR By ADOPTiNG
iNTelliGeNT CODiNG
TOOlS ThAT NOT ONly
iMPROVe CODiNG
effiCieNCy, BUT AlSO
helP wiTh COMPUTeR
ViSiON TASKS SUCh
AS OBjeCT DeTeCTiON
AND TRACKiNG.
economical coding/decoding technolo-
gies. After more than a decade, the work-
ing group has published a series of
standards, including AVS1, which is the
culmination of the first stage of work.
Table 1 shows the time line of the AVS1
video coding standard (for short, AVS1). In
AVS1, six profiles were defined to meet the
requirements of various applications. The
Main Profile focuses on digital video
applications like commercial broadcasting
and storage media, including high-defini-
tion video applications. It was approved as
a national standard in China: GB/T
20090.2-2006. It was followed by the
Enhanced Profile, an extension of the
Main Profile with higher coding efficiency,
targeting the needs of multimedia enter-
tainment, such as movie compression for
high-density storage. The Surveillance
Baseline and Surveillance Profiles focus
on video surveillance applications, consid-
ering in particular the characteristics of
surveillance videos, i.e., high noise levels,
relatively low encoding complexity, and
requirements for easy event detection and
search. The Portable Profile targets mobile
video applications with lower resolution,
low computational complexity, and robust
error resiliency to meet the wireless envi-
ronment. The latest Broadcasting Profile
is also an improvement of the Main Profile
and targets high-quality, high-definition
TV (HDTV) broadcasting. It was approved
and published as an industry standard by
the State of China Broadcasting Film and
Television Administration in July 2012.
AVS standards are also being recog-
nized internationally. In 2007, the Main
[TABle 1] TiMe liNe Of AVS1 ViDeO CODiNG STANDARD.
A
VS2 is a new generation of
video coding standard devel-
oped by the IEEE 1857
Working Group under
project 1857.4. AVS2 is
also the second-generation video coding
standard established by the Audio and
Video Coding Standard (AVS) Working
Group of China; the first-generation
AVS1 was developed by the AVS Working
Group and issued as Chinese national
standard GB/T 20090.2-2006 in 2006.
The AVS Working Group was founded in
2002 and is dedicated to providing the
digital audio-video industry with highly
efficient and economical coding/decod-
ing technologies. So far, the AVS1 video
coding standard is widely implemented
in regional broadcasting, communica-
tion, and digital video entertainment sys-
tems. As the successor of AVS1, AVS2 is
designed to achieve significant coding
efficiency improvements relative to the
preceding H.264/MPEG4-AVC and AVS1
standards. The basic coding framework
of AVS2 is similar to the conterminous
HEVC/H.265, but AVS2 can provide more
efficient compression for certain video
applications such as surveillance as well
as low-delay communication such as vid-
eoconferencing. AVS2 is making video
coding smarter by adopting intelligent
coding tools that not only improve cod-
ing efficiency but also help with com-
puter vision tasks such as object
detection and tracking.
BACKGROUND
The AVS Working Group was established
in March 2002 in China. The mandate of
the group is to establish generic techni-
cal standards for the compression,
TiMe
December
2003
PROfile
main
TARGeT APPliCATiON(s)
TV broaDcasTing
June 2008
surVeillance
baseline
ViDeo surVeillance
enhanceD
DigiTal cinema
sepTember
2008
July 2009
July 2011
MAjOR CODiNG TOOlS
8 # 8 block-baseD inTrapreDicTion,
Transform anD Deblocking filTer;
Variable block size moTion
compensaTion (16 # 16 + 8 # 8)
backgrounD-preDicTiVe picTure
for ViDeo coDing, aDapTiVe
WeighTing QuanTizaTion (aWQ),
core frame coDing
conTexT binary ariThmeTic
coDing (cbac), aWQ
backgrounD moDeling baseD
coDing
aWQ, enhanceD fielD coDing
Digital Object Identifier 10.1109/MSP.2014.2371951
Date of publication: 12 February 2015
may 2012
broaDcasTing hDTV
porTable
surVeillance ViDeo surVeillance
mobile ViDeo communicaTion 8 # 8/4 # 4 block Transform
IEEE SIGNAL PROCESSING MAGAZINE [172] MARCh 2015
1053-5888/15©2015IEEE
Siwei Ma, Tiejun Huang, Cliff Reader, and Wen Gao
Profile was accepted as an option of video
codecs for Internet Protocol Television
(IPTV) applications by the International
Telecommunication Union–Telecommu-
nication Standardization Sector (ITU-T)
Focus Group on IPTV standardization [1].
The IEEE 1857 Working Group was
established in 2012 to work on IEEE
standards for advanced audio and video
coding, based on individual members of
the IEEE Standards Association from the
AVS Working Group. The IEEE 1857
Working Group meets three to four
times annually to discuss the standard
technologies, syntax, and so on. Until
now, the IEEE 1857 Working Group has
finished three parts of IEEE 1857 stan-
dards, including IEEE 1857-2013 for
video, IEEE 1857.2-2013 for audio, and
IEEE 1857.3-2013 for system [2].
AVS standards have been developed in
compliance with the AVS intellectual prop-
erty rights (IPR) policy. This policy
includes up-front commitment by partici-
pants to license essential patents with
declaration of default licensing terms—roy-
alty-free without compensation [(RAND-
RF) and otherwise under reasonable and
nondiscriminatory terms], or participation
in the AVS patent pool, or RAND. The dis-
closure of published patent applications
and granted patents is required, and the
existence of unpublished applications is
also required if the RAND option is taken.
The licensing terms are also considered in
the adoption of proposals for AVS stan-
dards when all technical factors are equal.
Reciprocity in licensing is required. The
protection of participants’s IPR is provided
to guard against the situation in which the
IPR of a participant are disclosed by
another party. AVS has encouraged the
establishment of a Patent Pool Administra-
tion (PPA) that is independent from the
AS A SUCCeSSOR Of
AVS1, AVS2 iS DeSiGNeD
TO iMPROVe CODiNG
effiCieNCy fOR hiGheR
ReSOlUTiON ViDeOS
AND PROViDe effiCieNT
COMPReSSiON SOlUTiONS
fOR VARiOUS KiNDS
Of ViDeO APPliCATiONS.
AVS Working Group, which only focuses
on the standards. The AVS standards are
also fully compliant with the IPR policy of
IEEE standards.
Based on the success of AVS1 and the
recent research and standardization works,
AVS has been working on a new generation
of video coding technologies called AVS2
(or more specifically, Part 2 in the AVS2
series standards). In fact, since 2005 and
before the AVS2 project officially started,
AVS has been continuously working on an
AVS-X project to explore more efficient
coding techniques. AVS2 was started for-
mally by issuing a call for platforms in
March 2012. By October 2012, a reference
–
Transform/
Quantization
Entropy
Coding
Inv.Transform/
Dequantization
+
Loop Filter
Intraprediction
Interprediction
Motion
Estimation
Frame Buffer
[fiG1] The coding framework of an AVS2 encoder.
IEEE SIGNAL PROCESSING MAGAZINE [174] MARCh 2015
platform (RD 1.0) based on the AVS1 refer-
ence software was developed for AVS2 [3].
After that, AVS2 continued to improve its
coding efficiency, and the standard in com-
mittee draft 2.0 was finalized in June 2014.
It has been approved as a project of IEEE
standard, IEEE 1857.4, and a project of
Chinese national standard, both of which
are expected to be finished by the end of
2014 at the time of this writing.
As a successor of AVS1, AVS2 is designed
to improve coding efficiency for higher-res-
olution videos and provide efficient com-
pression solutions for various kinds of video
applications. Compared to the preceding
coding standards, AVS2 adopts smarter cod-
ing tools that are adapted to satisfy the new
requirements identified from emerging
applications. First, more flexible prediction
block partitions are used to further improve
prediction accuracy, e.g., square and non-
square partitions, which are more adaptive
to the image content especially in edge
areas. Related to the prediction structure,
transform block size is more flexible and
can be up to 64 # 64 pixels. After transfor-
mation, context adaptive arithmetic coding
is used for the entropy coding of the trans-
formed coefficients. A two-level coefficient
scan and coding method can encode the
coefficients of large blocks more efficiently.
Moreover, for low-delay communication
applications, e.g., video surveillance, video
conference, etc., where the background
usually does not often change, a back-
ground picture model-based coding
method is developed in AVS2.
The background picture constructed from
original pictures or decoded pictures is used
as a reference picture to improve prediction
efficiency. Test results show that this back-
ground picture-based prediction coding can
improve coding efficiency significantly. Fur-
thermore, the background picture can also
be used for object detection and tracking
for intelligent surveillance. In addition, to
support object tracking among multiple
cameras in surveillance applications, navi-
gation information such as those from the
global positioning system and BeiDou Navi-
gation Satellite System of China is also
defined, which mainly includes timing,
location, and movement information.
Finally, aiming at more intelligent surveil-
lance video coding, AVS2 also started a
[standards IN A NUTSHELL] continued
CU Partition
CU Depth,
d = 0
N0 = 32
Split Flag = 0 Split Flag = 1
0
2
1
3
2N0
CU0
2N0
2Nd × 2Nd
PU_Skip
/Direct
PU Partition
Split Flag = 0 Split Flag = 1
d = 0
2Nd × 2Nd
Nd × Nd
2Nd × 0.5Nd 0.5Nd × 2Nd
PU_Intra
CU Depth,
d = 1
N1 = 16
CU Depth,
d = 2
N2 = 8
CU Depth,
d = 3
N3 = 4
0
2
1
3
2N1
CU1
2N1
Split Flag = 0
Split Flag = 1
0
2
1
3
2N2
CU2
2N2
d = 1
d = 2
d = 3
2Nd × 2Nd
2Nd × Nd
Nd × 2Nd
Nd × Nd
PU_Inter
Last Depth: No Splitting Flag
2Nd × nU
2Nd × nD
nL × 2Nd
nR × 2Nd
2N3
CU3
2N3
n = 0.25Nd
[fiG2] (a) The maximum possible recursive CU structure in AVS2. (lCU size = 64, maximum hierarchical depth = 4). (b) Possible
PU splitting for skip, intramodes, and intermodes in AVS2, including symmetric and asymmetric prediction (d=1, 2 for
intraprediction, and d= 0,1,2 for interprediction).
digital media content description project in
which visual objects in the images or videos
are described with multilevel features for
facilitating visual object based storage,
retrieval, and interactive applications, etc.
This column will provide a short
overview of AVS2 video coding technol-
ogy and a performance comparison with
other video coding standards.
TeChNOlOGy AND Key feATUReS
Similar to previous coding standards,
AVS2 adopts the traditional prediction/
transform hybrid coding framework, as
shown in Figure 1. Within the framework,
a more flexible coding structure is
adopted for efficient high-resolution video
coding, and more efficient coding tools
are developed to make full use of the tex-
tual information and temporal redundan-
cies. These tools can be classified into
four categories: 1) prediction coding
(including intraprediction and interpre-
diction), 2) transform, 3) entropy coding,
and 4) in-loop filtering. We will give a
brief introduction to the coding frame-
work and coding tools.
Coding Framework
In AVS2, a coding unit (CU)-, prediction
unit (PU)-, and transform unit (TU)-based
coding/prediction/transform structure is
adopted to represent and organize the
encoded data [3]. First, pictures are split
into largest coding units (LCUs), which
consist of N
samples of a lumi-
nance component and associated chromi-
nance samples with
or 32. One
LCU can be a single CU or can be split into
four smaller CUs with a quad-tree parti-
tion structure; a CU can be recursively
split until it reaches the smallest CU size
limit, as shown in Figure 2(a). Once the
splitting of the CU hierarchical tree is
N 8 16
2#
N
=
2
,
,
IEEE SIGNAL PROCESSING MAGAZINE [175] MARCh 2015
finished, the leaf node CUs can be further
split into PUs. PU is the basic unit for
intra- and interprediction and allows mul-
tiple different shapes to encode irregular
image patterns, as shown in Figure 2(b).
The size of a PU is limited to that of a CU
with various square or rectangular shapes.
More specifically, both intra- and interpre-
diction partitions can be symmetric or
asymmetric. Intraprediction partitions
vary in the set {
2
#
while inter- prediction
0 5
.
},
partitions vary in the set {
2
,
#
N N
,
,
,
2
N
N N
#
#
U D L and R are the
,
,
nR
abbreviations of “Up,” “Down,” “Left,” and
“Right,” respectively. Besides CU and PU,
TU is also defined to represent the basic
unit for transform coding and quantiza-
tion. The size of a TU cannot exceed that
of a CU, but it is independent of the
PU size.
2
2
N N nU N nD nL
N2#
},
,
2
where
N N N N
N
#
,
2#
0 5
.
N
N
N
,
,
2
N
#
#
2
,
#
2
#
2
,
Zone 3
Zone 2
Zone 1
3
5
4
7
6
8
9
10
14
11
13
12
31
32
30
28
26
22
29
27
25
23
24
18
16
15
20
21
19
17
DC: 0
Plane: 1
Bilinear: 2
[fiG3] An illustration of directional prediction modes.
intraPrediCtion
Intraprediction is used to reduce the
redundancy existing in the spatial domain
of the picture. Block partition-based direc-
tional prediction is used for AVS2 [5]. As
shown in Figure 2, besides the square PU
partitions, nonsquare partitions, called
short distance intra prediction (SDIP), are
adopted by AVS2 for more efficient intralu-
minance prediction [4], where the nearest
reconstructed boundary pixels are used as
the reference sample in intraprediction.
For SDIP, a N
PU is horizontally/
2#
2
N
vertically partitioned into four prediction
blocks. SDIP is more adaptive to the image
content, especially in edge area. But for the
complexity reduction, SDIP is used in all
CU sizes except a 64 # 64 CU. For each
prediction block in the partition modes, a
total of 33 prediction modes are supported
for luminance, including 30 angular
modes [5], a plane mode, a bilinear mode,
and a DC mode. Figure 3 shows the distri-
bution of the prediction directions associ-
ated with the 30 angular modes. Each
sample in a PU is predicted by projecting
its location to the reference pixels applying
the selected prediction direction. To
improve the intraprediction accuracy, the
subpixel precision reference samples must
be interpolated if the projected reference
samples locate on a noninteger position.
The noninteger position is bounded to 1/32
sample precision to avoid floating point
operation, and a four-tap linear interpola-
tion filter is used to get the subpixel.
,
For the chrominance component, the
PU size is always
N N# and five prediction
modes are supported, including vertical pre-
diction, horizontal prediction, bilinear pre-
diction, DC prediction, and the prediction
mode derived from the corresponding lumi-
nance prediction mode [6].
interPrediCtion
Compared to the spatial intraprediction,
interprediction focuses on exploiting the
temporal correlation between the consec-
utive pictures to reduce the temporal
redundancy. Multireference prediction has
been used since the H.264/AVC standard,
including both short-term and long-term
reference pictures. In AVS2, long-term ref-
erence picture usage is extended further,
which can be constructed from a sequence
of long-term decoded pictures, e.g., back-
ground picture used in surveillance cod-
ing, which will be discussed separately
later. For short-term reference prediction
in AVS2, F frames are defined as a special
P frame [7], in addition to the traditional P
and B frames. More specifically, a P frame
is a forward-predicted frame using a single
reference picture, while a B frame is a
bipredicted frame that consists of forward,
ref_blk2
Scaled
MV
ref_blk1
MV
Mode 3
Mode 2
Mode 4
Pixel Indicated by MV
Current
PU
Mode 1
Distance
2
Distance 1
ref2
ref1
(a)
Current Frame
Distance
(b)
[fiG4] (a) Temporal multihypothesis mode. (b) Spatial multihypothesis mode.
IEEE SIGNAL PROCESSING MAGAZINE [176] MARCh 2015
[standards IN A NUTSHELL] continued
backward, biprediction, and symmetric
prediction, using two reference frames.
In a B frame, in addition to the
conventional forward, backward, bi-
directional, and skip/direct prediction
modes, symmetric prediction is defined as a
special biprediction mode, wherein only
one forward motion vector (MV) is coded
and the backward MV is derived from the
forward MV. For an F frame, besides the
conventional single hypothesis prediction
mode in a P frame, multihypothesis tech-
niques are added for more efficient predic-
tion, including the advanced skip/direct
mode [8], temporal multihypothesis predic-
tion mode [9], and spatial directional multi-
hypothesis (DMH) prediction mode [10].
In an F frame, an advanced skip/direct
mode is defined using a competitive
motion derivation mechanism. Two deri-
vation methods are used: one is temporal
and the other is spatial. Temporal multihy-
pothesis mode combines two predictors
along the predefined temporal direction,
while spatial multihypothesis mode com-
bines two predictors along the predefined
spatial direction. For temporal derivation,
the prediction block is obtained by an aver-
age of the prediction blocks indicated by
the MV prediction (MVP) and the scaled
MV in a second reference. The second ref-
erence is specified by the reference index
transmitted in the bit stream. For tempo-
ral multihypothesis prediction, as shown
in Figure 4, one predictor ref_blk1 is gen-
erated with the best MV MV and a refer-
ence frame ref1 is searched by motion
estimation, and then this MV is linearly
scaled to a second reference to generate
another predictor ref_blk2. The second
reference ref2 is specified by the reference
index transmitted in the bit stream. In
DMH mode, as specified in Figure 4, the
seed predictors are located on the line
crossing the initial predictor obtained
from motion estimation. The number of
seed predictors is restricted to eight. If one
seed predictor is selected for combined
prediction, for example “Mode 1,” then the
index of the seed predictor “1” will be sig-
naled in the bit stream.
For spatial derivation, the prediction
block may be obtained from one or two
prediction blocks specified by the motion
copied from its spatial neighboring
B
G
C
Current PU
D
A
F
[fiG5] An illustration of neighboring
blocks A, B, C, D, f, and G for MVP.
blocks. The neighboring blocks are illus-
trated in Figure 5. They are searched in a
predefined order F, G, C, A, B, D, and the
selected neighboring block is signaled in
the bit stream.
motion VeCtor PrediCtion
and Coding
MVP plays an important role in interpre-
diction, which can reduce the redundancy
among MVs of neighboring blocks and
thus save large numbers of coding bits for
MVs. In AVS2, four different prediction
methods are adopted, as tabulated in
Table 2. Each of them has its unique
usage. Spatial MVP is used for the spatial
derivation of Skip/Direct mode in F frames
and B frames. Temporal MVP is used for
temporal derivation of Skip/Direct mode
in P frames and F frames. Spatial-tempo-
ral-combined MVP is used for the joint
temporal and spatial derivation of Skip/
Direct mode in B frames. For other cases,
median prediction is used.
In AVS2, the MV is in quarter-pixel
precision for the luminance component,
and the subpixel is interpolated with an
eight-tap DCT interpolation filter (DCT-
IF) [11]. For the chrominance compo-
nent, the MV derived from luminance
with 1/8 pixel precision and a four-tap
DCT-IF is used for subpixel interpolation
[12]. After the MVP, the MV difference
(MVD) is coded in the bit stream. How-
ever, redundancy may still exist in MVD,
and to further save coding bits of MVs, a
progressive MV resolution adaptation
method is adopted in AVS2 [13]. In this
scheme, the MVP is firstly rounded to the
nearest integer sample position, and then
the MV is rounded to a half-pixel preci-
sion if its distance from MVP is larger
than a by a threshold. Finally, the resolu-
tion of the MVD is decreased to half-pixel
precision if it is larger than a threshold.
2
N
2#
transForm
Two-level transform coding is utilized to
further compress the predicted residual.
For a CU with symmetric prediction unit
partition, the TU size can be N
or
N N# signaled by a transform split flag.
Thus, the maximum transform size is
64 # 64, and the minimum transform
size is 4 # 4. For the TU size 4 # 4 to 32
# 32, an integer transform (IT) that
closely approximates the performance of
the discrete cosine transform (DCT) is
used; while for the 64 # 64 transform, a
logical transform (LOT) [14] is applied to
the residual. A five-three-tap integer wave-
let transform is first performed on a 64 #
64 block discarding the low-high (LH),
high-low (HL), and (high-high) HH-
bands, and then a normal 32 # 32 IT is
applied to the low-low (LL)-band. For a CU
that has an asymmetric PU partition, a
2
IT is used in the first level and a
nonsquare transform [15] is used in the sec-
ond level, as shown in Figure 6. Moreover,
in the latest AVS2 standard, a secondary
transform was adopted for intraprediction
residual (for more details see the latest AVS
specification document N2120 on the AVS
FTP Web site [21]).
N
2#
N
entroPy Coding
After transform and quantization, a two-
level coding scheme is applied to the
[TABle 2] MV PReDiCTiON MeThODS iN AVS2.
MeThOD
meDian
spaTial
Temporal
spaTial-Temporal combineD
DeTAilS
using The meDian mV Values of The neighboring blocks.
using The mVs of spaTial neighboring blocks.
using The mVs of Temporal collocaTeD blocks.
using The Temporal mVp firsT if iT is aVailable, anD spaTial
mVp is useD insTeaD if The Temporal mVp is noT aVailable.
IEEE SIGNAL PROCESSING MAGAZINE [177] MARCh 2015
2N × nU
2N × nD
nL × 2N
nR × 2N
PU
Others
2N × 2N
Split
2N × 2N
Split
2N × 2N
Split
Level 0
TU
Level 1
2N × 0.5N
0.5N × 2N
[fiG6] A PU partition and two-level transform coding.
(a)
(b)
(c)
[fiG7] A subblock scan for transform blocks of size (a) 8 # 8, (b) 16 # 16, and (c) 32 #
32 transform blocks; each subblock represents a 4 # 4 CG.
A
B
(a)
A
A
B
B
(b)
(c)
[fiG8] A subblock region partitions of 4 # 4 CG in an intraprediction block.
IEEE SIGNAL PROCESSING MAGAZINE [178] MARCh 2015
transform coefficient blocks [16]. A coeffi-
cient block is partitioned into 4 # 4 coef-
ficient groups (CGs), as shown in
Figure 7. Then zig-zag scanning and con-
text-adaptive binary arithmetic coding
(CABAC) is performed at both the CG
level and coefficient level. At the CG level
for a TU, the CGs are scanned in zig-zag
order, and the CG position indicating the
position of the last nonzero CG is coded
first, followed by a bin string of significant
CG flags indicating whether the CG
scanned in zig-zag order contains non-
zero coefficients. At the coefficient level,
for each nonzero CG, the coefficients are
further scanned into the form of (run,
level) pair in zig-zag order. Level and run
refer to the magnitude of a nonzero coeffi-
cient and the number of zero coefficients
between two nonzero coefficients, respec-
tively. For the last CG, the coefficient posi-
tion that denotes the position of the last
nonzero coefficient in scan order is coded
first. For a nonlast CG, a last run is coded
that denotes number of zero coefficients
after the last nonzero coefficient in zig-
zag scan order. And then the (level, run)
pairs in a CG are coded in reverse zig-zag
scan order.
For the context modeling used in the
CABAC, AVS2 employs a mode-depen-
dent context selection design for intra-
prediction blocks [17]. In this context
design, 34 intraprediction modes are
classified into three prediction mode
sets: vertical, horizontal, and diagonal.
Depending on the prediction mode set,
each CG is divided to two regions, as
shown in Figure 8. The intraprediction
modes and CG regions are applied in the
context coding of syntax elements
including the last CG position, last coef-
ficient position, and run value.
in-looP Filtering
Artifacts such as blocking artifacts, ring-
ing artifacts, color biases, and blurring
artifacts are quite common in com-
pressed video, especially at medium and
low bit rate. To suppress those artifacts,
deblocking filtering, sample adaptive off-
set (SAO) filtering [18], and adaptive
loop filter (ALF) [19] are applied to the
reconstructed pictures sequentially.
[standards IN A NUTSHELL] continued
Deblocking filtering aims to remove
the blocking artifacts caused by block
transform and quantization. The basic unit
for the deblocking filter is an 8 # 8 block.
For each 8 # 8 block, the deblocking filter
is used only if the boundary belongs to
either of the CU, PU, or TU boundaries.
After the deblocking filter, an SAO fil-
ter is applied to reduce the mean sample
distortion of a region, where an offset is
added to the reconstructed sample to
reduce ringing artifacts and contouring
artifacts. There are two kinds of offset:
edge offset (EO) and band offset (BO)
mode. For the EO mode, the encoder can
select and signal a vertical, horizontal,
downward-diagonal, or upward-diagonal
filtering direction. For BO mode, an off-
set value that directly depends on the
amplitudes of the reconstructed samples
is added to the reconstructed samples.
ALF is the last stage of in-loop filtering.
There are two stages in this process. The
first stage is filter coefficient derivation. To
train the filter coefficients, the encoder
classifies reconstructed pixels of the lumi-
nance component into 16 categories, and
one set of filter coefficients is trained for
each category using Wiener–Hopf equa-
tions to minimize the mean squared error
between the original frame and the recon-
structed frame. To reduce the redundancy
between these 16 sets of filter coefficients,
the encoder will adaptively merge them
based on the rate- distortion performance.
At its maximum, 16 different filter sets can
be assigned for the luminance component
and only one for the chrominance compo-
nents. The second stage is a filter decision,
which includes both the frame level and
LCU level. First, the encoder decides
whether frame-level adaptive loop filtering
is performed. If frame level ALF is on, then
the encoder further decides whether the
LCU level ALF is performed.
smart sCene Video Coding
More and more videos being captured in
specific scenes (such as surveillance video
and videos from the classroom, home,
courthouse, etc.) are characterized by a
temporally stable background. The redun-
dancy originating from the background
could be further reduced. AVS2 developed
a background picture model-based coding
method [20], which is illustrated in
Figure 9. G-pictures and S-pictures are
defined to further exploit the temporal
redundancy and facilitate video event gen-
eration such as object segmentation and
motion detection. The G-picture is a spe-
cial I-picture, which is stored in a separate
background memory. The S- picture is a
special P-picture, which can be only pre-
dicted from a reconstructed G-picture or a
virtual G-picture, which does not exist in
the actual input sequence but is modeled
from input pictures and encoded into the
stream to act as a reference picture.
The G-picture is initialized by back-
ground initialization and updated by
background modeling with methods such
as median filtering, fast implementation
Raw
Video
G-Picture
Initialization
Background
Modeling
–
DCT&Q
Decoder
Entropy
Coding
Bit
Stream
IQ and IDCT
+
MC/
Intraprediction
Reconstruction
Buffer
S-Picture Decision
Background
Reference Selection
Background
Difference Prediction
ME
Loop Filter
Reference
Memory
Background
Memory
[fiG9] A background picture-based scene coding in AVS2.
(a)
(b)
(c)
[fiG10] examples of the background picture and the difference frame between the original picture and the background picture:
(a) original picture, (b) difference frame, and (c) background picture.
IEEE SIGNAL PROCESSING MAGAZINE [179] MARCh 2015
)
B
d
(
R
N
S
P
42
41
40
39
38
37
36
35
34
33
32
Main Road
AVS2
HEVC
8,000
10,000 12,000
0
2,000
4,000
6,000
kb/s
(a)
)
B
d
(
R
N
S
P
39
38
37
36
35
34
33
32
31
30
29
Over a Bridge
AVS2
HEVC
2,000
2,500
3,000
0
500
1,000
1,500
kb/s
(b)
[fiG11] A performance comparison between AVS2 and heVC for surveillance videos: (a) main road and (b) over a bridge.
of a Gaussian mixture model, etc. In this
way, the selected or generated G- picture
can well represent the background of a
scene with rare occluding foreground
objects and noise. Once a G-picture is
obtained, it is encoded and the recon-
structed picture is stored into the back-
ground memory in the encoder/decoder
and updated only if a new G- picture is
selected or generated. After that,
S- pictures can be involved in the encod-
ing process by an S-picture decision.
Except that it uses a G-picture as a refer-
ence, the S-picture owns similar properties
as the traditional I-picture such as error
resilience and random access (RA). There-
fore, the pictures that should be coded as
traditional I-pictures can be candidate
S-pictures, such as the first picture of one
group of pictures, or scene change, etc.
Besides bringing about more prediction
opportunity for those background blocks
that normally dominate a picture, an
additional benefit from the background
picture is a new prediction mode called
background difference prediction, as
shown in Figure 10, which can improve
foreground prediction performance by
excluding the background influence. It
can be seen that, after background differ-
ence prediction, the background redun-
dancy is effectively removed. Furthermore,
according to the predication modes in the
AVS2 compression bit stream, the blocks of
an AVS2 picture could be classified as back-
ground blocks, foreground blocks, or
blocks on the edge area. Obviously, this
information is very helpful for possible
subsequent vision tasks such as object
detection and tracking. Object-based cod-
AVS2 hAS BeeN
DeVelOPeD iN
ACCORDANCe wiTh
AVS AND ieee iPR
POliCieS TO eNSURe
RAPiD liCeNSiNG Of
eSSeNTiAl PATeNTS
AT COMPeTiTiVe
ROyAlTy RATeS.
ing has already been proposed in MPEG-4;
however, object segmentation remains a
challenging problem, which constrains
the application of object-based coding.
Therefore AVS2 uses simple background
modeling instead of accurate object seg-
mentation, which is easier and provides a
good tradeoff between coding efficiency
and complexity.
To provide convenience for applica-
tions like event detection and searching,
AVS2 added some novel high-level syntax
to describe the region of interest (ROI). In
the region extension, the region number,
event ID, and coordinates for top left and
bottom right corners are included to show
what number the ROI is, what event hap-
pened, and where it lies.
PeRfORMANCe COMPARiSON
The major target applications of AVS2 are
high-quality TV broadcasting and scene
videos. For high-quality broadcasting, RA
is necessary and may be achieved by
inserting intraframes at a fixed interval,
e.g, 0.5 s. And for high-quality video cap-
ture and editing, all intracoding (AI) is
required. For scene video applications,
e.g., video surveillance or videoconference,
low delay (LD) needs to be guaranteed.
According to the applications, we tested
[TABle 3] BiT RATe SAViNG Of AVS2 PeRfORMANCe COMPARiSON
wiTh AVS1 AND heVC.
SeqUeNCeS
uhD
1080p
1200p
sD
oVerall
Ai
CONfiGURATiON
RA
CONfiGURATiON
AVS2 VeRSUS
AVS1
31.2%
33%
AVS2 VeRSUS
heVC
2.4%
0.8%
AVS2 VeRSUS
AVS1
50.3%
50.3%
AVS2 VeRSUS
heVC
−0.4%
0.3%
lD
CONfiGURATiON
AVS2 VeRSUS
heVC
32.1%
1.6%
50.3%
−0.1%
37.9%
26.2%
32.1%
IEEE SIGNAL PROCESSING MAGAZINE [180] MARCh 2015
[standards IN A NUTSHELL] continued