Overview of current research
work:
Scalable
Video coding :
Due
to advances in computer and network technology over past decade, a single
workstation may serve as a personal computer, a high-definition TV, a videophone,
or a fax machine. The main media of transmission is a computer network, which
is often a heterogeneous environment, consisting of a diverse mixture of
subnets and network users. In this situation, scalable transmission of video is
essential to service different clients with widely varying display and
processing capabilities. Scalability refers to the ability of an algorithm to
decode a certain part of video bitstream to obtain a video at the desired
quality or spatiotemporal resolution
Since
its introduction, subband/wavelet coding has emerged as a powerful method for
compressing still images and videos. Besides its effectiveness in data
compression, the presence of subbands makes this scheme a natural choice for
scalability applications. In simple 3-D subband/wavelet schemes, subband
decomposition is extended in temporal domain also. But its performance can be
improved with motion compensation.
MCTF
with 2-tap Haar filters :
Most
of the current schemes use 2-tap Haar filters for temporal analysis/synthesis.
In this approach, the temporal low subband is motion compensated average of two
frames, while the temporal high subband is the motion compensated difference.
Low temporal subbands generated in such coders can be sent as a part of a low frame-rate
sequence. With the sub-pixel motion compensation, images need to be
interpolated for MC temporal filtering (MCTF). Thus the resulting
analysis/synthesis scheme is not invertible. To achieve invertibility for any
arbitrary subpixel accuracy, a lifting scheme was used. Besides the
computational efficiency, this is the big advantage of the lifting scheme.
MCTF
with 5/3 filters :
Instead
of 2-tap Haar filter, a longer length filter can make better use of the
correlation in the temporal domain. One main advantage of Haar filters over
other longer filters is we need the motion field between every other pair of
input frames as opposed to every other frame.
In
current work, we use a lifting based 3-D subband/wavelet coder using 5/3
filters for temporal filtering and unidirectional motion estimation. In this
approach the backward motion field (i.e. the current frame comes before the
reference frame) is used for the MCTF with quarter pixel accuracy, as
determined using hierarchical variable size block matching (HVSBM). We estimate and transmit a backward motion
field between every consecutive frame and infer the forward motion field from
this backward motion field. All the temporal subbands generated are further spatially
analyzed and encoded using EZBC.
If
we retain the fixed size GOP structure of the Haar filter MCTFs, we need to use
symmetric extensions at both ends of the GOP. This gives rise to loss of coding
efficiency at the GOP boundaries resulting in significant PSNR drops there. This performance can be considerably
improved by using a 'sliding window,' in place of the GOP block. We employ the
5/3 filter and its non-orthogonality causes PSNR variation, which can be
reduced by employing filter-based weighting coefficients.
Overall
the longer filters have a higher coding gain than the Haar filters and show
significant improvement in average PSNR at high bit rates. However, a doubling
in the number of motion vectors to be transmitted, translates to a drop in PSNR
at the lower video bit rates.
Motion
vector estimation and encoding :
Since
we have motion vector data available between each frame, we can use this
temporal redundancy in motion vector estimation and motion vector coding.
The
motion estimation can either be done independently at each temporal level or
MVs at the previous level were used as the starting point for current
level.
We
do the pixel-by-pixel vector addition of two motion fields at the previous
level and use that as starting point for the motion vector search. We can generally use a smaller refinement
range to generate the initial quadtree and then prune again. Thus instead of using a spatial
multiresolution pyramid like the one used in HVSBM, we use the temporal
pyramid. The smaller refinement range
used gives rise to a more uniform motion field and can help in the motion
vector encoding.
The
motion vector prediction residuals to be encoded can be evaluated by 3 methods:
use differentials along the scanning order (Scan), use spatial prediction from
neighboring blocks (Spatial), or use temporal prediction from MVs of previous
frame (Temporal). For Bus and Mobile,
the temporal error works better, but for Foreman and Football
spatial error works better.
In our old scheme, we use adaptive arithmetic coding (AAC) described by Witten et al. We used one probability model for all the motion vector symbols in a given frame and updated it adaptively at the encoder and decoder. As the number of symbols increases, this scheme faces the zero frequency problem, i.e. even the unused symbols must be assigned some initial probability. We can replace this m-ary arithmetic coding by a context based binary arithmetic coding (CABAC) scheme similar to the one used for H.26L. Thus CABAC proves more successful in adapting to different motion models.
Related
Publications: