科研成果 by Type: 期刊论文

2014

ZhangXianguo(博士生)；*HuangTieJun；TianYonghong；GaoWen. Background-modeling-based adaptive prediction for surveillance video coding. IEEE Transactions on Image Processing [Internet]. 2014;23(2):769-784. 访问链接 Abstract

The exponential growth of surveillance videos presents an unprecedented challenge for high-efficiency surveillance video coding technology. Compared with the existing coding standards that were basically developed for generic videos, surveillance video coding should be designed to make the best use of the special characteristics of surveillance videos (e.g., relative static background). To do so, this paper first conducts two analyses on how to improve the background and foreground prediction efficiencies in surveillance video coding. Following the analysis results, we propose a background-modeling-based adaptive prediction (BMAP) method. In this method, all blocks to be encoded are firstly classified into three categories. Then, according to the category of each block, two novel inter predictions are selectively utilized, namely, the background reference prediction (BRP) that uses the background modeled from the original input frames as the long-term reference and the background difference prediction (BDP) that predicts the current data in the background difference domain. For background blocks, the BRP can effectively improve the prediction efficiency using the higher quality background as the reference; whereas for foreground-background-hybrid blocks, the BDP can provide a better reference after subtracting its background pixels. Experimental results show that the BMAP can achieve at least twice the compression ratio on surveillance videos as AVC (MPEG-4 Advanced Video Coding) high profile, yet with a slightly additional encoding complexity. Moreover, for the foreground coding performance, which is crucial to the subjective quality of moving objects in surveillance videos, BMAP also obtains remarkable gains over several state-of-the-art methods.

Xianguo GW; TY; HT; MS; Z. IEEE 1857 Standard Empowering Smart Video Surveillance Systems. IEEE Intelligent Systems. 2014;29(1).Abstract

IEEE 1857, Standard for Advanced Audio and Video Coding, was released as IEEE 1857-2013 in June 2013. Despite consisting of several different groups, the most significant feature of IEEE 1857-2013 is its surveillance groups, which can not only achieve at least twice coding efficiency on surveillance videos as H.264/AVC HP, but also should be the most recognition-friendly video coding standard till to now. This article presents an overview of IEEE 1857 surveillance groups, highlighting the background model based coding technology and recognition-friendly functionalities. We believe that IEEE 1857-2013 will bring new opportunities and drives to the research communities and industries on smart video surveillance systems.

Wen HT; TY; G. IEEE 1857: Boosting Video Applications in CPSS. IEEE Intelligent Systems. 2014;29(2).

Zhang, Xianguo; *Tian Y; HT; DS; GW. Optimizing the hierarchical prediction and coding in HEVC for surveillance and conference videos with background modeling. IEEE Transactions on Image Processing [Internet]. 2014;23(10):4511-4526. 访问链接 Abstract

For the real-time and low-delay video surveillance and teleconferencing applications, the newly video coding standard HEVC can achieve much higher coding efficiency over H.264/AVC. However, we still argue that the hierarchical prediction structure in the HEVC low-delay encoder still does not fully utilize the special characteristics of surveillance and conference videos that are usually captured by stationary cameras. In this case, the background picture (G-picture), which is modeled from the original input frames, can be used to further improve the HEVC low-delay coding efficiency meanwhile reducing the complexity. Therefore, we propose an optimization method for the hierarchical prediction and coding in HEVC for these videos with background modeling. First, several experimental and theoretical analyses are conducted on how to utilize the G-picture to optimize the hierarchical prediction structure and hierarchical quantization. Following these results, we propose to encode the G-picture as the long-term reference frame to improve the background prediction, and then present a G-picture-based bit-allocation algorithm to increase the coding efficiency. Meanwhile, according to the proportions of background and foreground pixels in coding units (CUs), an adaptive speed-up algorithm is developed to classify each CU into different categories and then adopt different speed-up strategies to reduce the encoding complexity. To evaluate the performance, extensive experiments are performed on the HEVC test model. Results show our method can averagely save 39.09% bits and reduce the encoding complexity by 43.63% on surveillance videos, whereas those are 5.27% and 43.68% on conference videos.

Ji, Rongrong; *Duan L-Y; CJ; HT; GW. Mining compact bag-of-patterns for low bit rate mobile visual search. IEEE Transactions on Image Processing [Internet]. 2014;23(7):3099-3113. 访问链接 Abstract

Visual patterns, i.e., high-order combinations of visual words, contributes to a discriminative abstraction of the high-dimensional bag-of-words image representation. However, the existing visual patterns are built upon the 2D photographic concurrences of visual words, which is ill-posed comparing with their real-world 3D concurrences, since the words from different objects or different depth might be incorrectly bound into an identical pattern. On the other hand, designing compact descriptors from the mined patterns is left open. To address both issues, in this paper, we propose a novel compact bag-of-patterns (CBoPs) descriptor with an application to low bit rate mobile landmark search. First, to overcome the ill-posed 2D photographic configuration, we build up a 3D point cloud from the reference images of each landmark, therefore more accurate pattern candidates can be extracted from the 3D concurrences of visual words. A novel gravity distance metric is then proposed to mine discriminative visual patterns. Second, we come up with compact image description by introducing a CBoPs descriptor. CBoP is figured out by sparse coding over the mined visual patterns, which maximally reconstructs the original bag-of-words histogram with a minimum coding length. We developed a low bit rate mobile landmark search prototype, in which CBoP descriptor is directly extracted and sent from the mobile end to reduce the query delivery latency. The CBoP performance is quantized in several large-scale benchmarks with comparisons to the state-of-the-art compact descriptors, topic features, and hashing descriptors. We have reported comparable accuracy to the million-scale bag-of-words histogram over the million scale visual words, with high descriptor compression rate (approximately 100-bits) than the state-of-the-art bag-of-words compression scheme.

*Gao, Wen; Huang T; RC; DW; CX. IEEE standards for advanced audio and video coding in emerging applications. Computer [Internet]. 2014;47(5):81-83. 访问链接 Abstract

The IEEE audio- and video-coding standards family includes updated tools that can be configured to serve new applications, such as surveillance, Internet, and intelligent systems video.

*Li, Jia; Tian Y; HT. Visual saliency with statistical priors. International Journal of Computer Vision [Internet]. 2014;107(3):239-253. 访问链接 Abstract

Visual saliency is a useful cue to locate the conspicuous image content. To estimate saliency, many approaches have been proposed to detect the unique or rare visual stimuli. However, such bottom-up solutions are often insufficient since the prior knowledge, which often indicates a biased selectivity on the input stimuli, is not taken into account. To solve this problem, this paper presents a novel approach to estimate image saliency by learning the prior knowledge. In our approach, the influences of the visual stimuli and the prior knowledge are jointly incorporated into a Bayesian framework. In this framework, the bottom-up saliency is calculated to pop-out the visual subsets that are probably salient, while the prior knowledge is used to recover the wrongly suppressed targets and inhibit the improperly popped-out distractors. Compared with existing approaches, the prior knowledge used in our approach, including the foreground prior and the correlation prior, is statistically learned from 9.6 million images in an unsupervised manner. Experimental results on two public benchmarks show that such statistical priors are effective to modulate the bottom-up saliency to achieve impressive improvements when compared with 10 state-of-the-art methods.

Huang, Tiejun; Dong S; *TY. Representing Visual Objects in HEVC Coding Loop. IEEE Journal on Emerging and Selected Topics in Circuits and Systems. 2014;4(1):5-16.Abstract

Different from the previous video coding standards that employ fixed-size coding blocks (and macroblocks), the latest high efficiency video coding (HEVC) introduces a quadtree structure to represent variable-size coding blocks in the coding loop. The main objective of this study is to investigate a novel way to reuse these variable-size blocks to represent the foreground objects in the picture. Towards this end, this paper proposes three methods, i.e., flagging the blocks lying in the object regions flagging compression blocks (FCB), adding an object tree in each Coding Tree Unit to describe the objects' shape in it additional object tree (AOT) and confining the block splitting procedure to fit the object shape confining by shape (CBS). Among them, FCB and CBS add a flag bit in the syntax description of the block to indicate whether it lies in the objects region, while AOT adds a separate quadtree to represent the objects. For all these methods, the additional bits are then fed into the HEVC entropy coding module to compress. As such, the representation of visual objects in the pictures can be implemented in the HEVC coding loop by reusing the variable-size blocks and entropy coding, without additional coding tools. The experiments on six manually-segmented HEVC testing sequences (three in 1080P and three in 720P) demonstrate the feasibility and effectiveness of our proposal. To represent the objects in the 1080P testing sequences, the BD rate increases of FCB, AOT, and CBS over the HEVC anchor are 1.57%, 3.27%, and 5.93% respectively; while for the 720P conference videos, those are 4.57%, 17.23%, and 26.93% (note that the average bitrate of the anchor is only 1009 kb/s).

2013

LiJia(博士后)；*TianYonghong；DuanLingyu；HuangTiejun. Estimating Visual Saliency Through Single Image Optimization. IEEE Signal Processing Letters. 2013;20(9):845-848.Abstract

This letter presents a novel approach for visual saliency estimation through single image optimization. Instead of directly mapping visual features to saliency values with a unified model, we treat regional saliency values as the optimization objective on each single image. By using a quadratic programming framework, our approach can adaptively optimize the regional saliency values on each specific image to simultaneously meet multiple saliency hypotheses on visual rarity, center-bias and mutual correlation. Experimental results show that our approach can outperform 14 state-of-the-art approaches on a public image benchmark.

姜延张海波; 黄铁军;. 基于颜色和纹理特征的面料图像情感语义分析. 天津工业大学学报 [Internet]. 2013;(04):26-32. 访问链接 Abstract

在前期对服装面料图像的情感描述进行研究并建立3维面料图像情感因子空间模型的基础上,通过对面料图像样品的颜色、纹理低层特征(饱和度、色相冷暖、对比度、灰度图、灰度矩阵、平均色调等)和3个因子之间对应关系的分析,得出第1个因子可以用7维特征(6维的饱和度-冷暖模糊直方图加1维的对比度)来表征;第2个因子可以用257维特征(256维的灰度图加1维的彩色对比度)来表征;第3个因子可以用4维特征(3维的灰度矩阵参数加1维的平均色调值)来表征,为实现面料图像情感识别和检索奠定基础.

高文黄铁军; 张贤国;. 支持监控视频高效压缩与识别的IEEE 1857标准. 电子产品世界 [Internet]. 2013;(07):22-26+29. 访问链接 Abstract

我国技术专家为主研究制定的数字视频编解码技术标准AVS于2013年6月4日被国际电子电气工程师协会(IEEE)标准化委员会颁布为IEEE1857标准。该标准独具特色的一个部分是针对视频监控的监控档次AVS-S2,编码压缩性能达到目前视频监控业界主流使用的H.264(又称MPEG-4AVC)标准的两倍,而且在码流层支持感兴趣区域的自动提取与表达。本文介绍了AVS-S2的制定过程、关键技术及其与其它标准的压缩效率对比情况。

王国中黄铁军; 高文;. 数字音视频编解码技术标准AVS发展历程与应用前景. 上海大学学报(自然科学版) [Internet]. 2013;(03):221-224. 访问链接 Abstract

数字音视频编解码技术标准AVS是我国自主创新战略实施的典型案例.在认真分析本领域国内外知识产权现状的基础上,相继制定了AVS国家标准、广电行业标准和IEEE国际标准,有力地支撑了我国数字视听产业"由大变强"的新格局.建立了由上百项自主专利组成的专利群,扭转了我国本领域相关企业长期受制于国外标准高额专利费而难以健康发展的被动局面.带动20多家芯片企业开发出了符合的芯片,构建了"以我为主、全面开放"的完整产业链.全国20多个省市和多个国家采用了AVS标准播出的电视频道上千路.目前,中国的中央电视台正在部署采用这个标准进行高清立体节目的卫星播出工作.

田永鸿许腾(硕士生); 黄. 车载视觉系统中的行人检测技术综述. 中国图象图形学报 [Internet]. 2013;(04):359-367. 访问链接 Abstract

作为计算机视觉以及智能车辆领域的一个重要研究方向,车载视觉系统中的行人检测技术近年来得到了业界广泛关注。本文对2005年以来该技术中最重要的两个环节——感兴趣区域分割以及目标识别的研究现状进行综述,首先将感兴趣区域分割的典型方法按照分割所用信息的不同进行分类并对比它们的优缺点,之后对行人目标识别的特征提取、分类器构造以及搜索框架等方面的进展进行总结,最后对未来发展作出展望。

刘瑞璞张海波; 黄铁军;. 基于颜色特征的男西装图像情感语义分析. 东华大学学报(自然科学版) [Internet]. 2013;(02):185-190+195. 访问链接 Abstract

在前期对男西装情感描述研究并建立二维男西装情感因子空间模型的基础上,通过对男西装图像样品的颜色特征(色相冷暖、色彩亮度及对比度)的分析,得出第一个情感因子可以较好地用10维亮度——冷暖模糊直方图解释,第二个情感因子可以利用7维的饱和度——冷暖模糊直方图和图像对比度综合起来解释.研究结果为下一步实现男西装图像情感识别和检索打下基础.

Duan, Ling-Yu; *Ji R; CJ; YH; HT; GW. Learning from mobile contexts to minimize the mobile location search latency. Signal Processing: Image Communication [Internet]. 2013;28(4):368-385. 访问链接 Abstract

We propose to learn an extremely compact visual descriptor from the mobile contexts towards low bit rate mobile location search. Our scheme combines location related side information from the mobile devices to adaptively supervise the compact visual descriptor design in a flexible manner, which is very suitable to search locations or landmarks within a bandwidth constraint wireless link. Along with the proposed compact descriptor learning, a large-scale, contextual aware mobile visual search benchmark dataset PKUBench is also introduced, which serves as the first comprehensive benchmark for the quantitative evaluation of how the cheaply available mobile contexts can help the mobile visual search systems. Our proposed contextual learning based compact descriptor has shown to outperform the existing works in terms of compression rate and retrieval effectiveness.

Tian, Yonghong; *Huang T; JM; GW. Video copy-detection and localization with a scalable cascading framework. IEEE Multimedia [Internet]. 2013;20(3):72-86. 访问链接 Abstract

For video copy detection, no single audio-visual feature, or single detector based on several features, can work well for all transformations. This article proposes a novel video copy-detection and localization approach with scalable cascading of complementary detectors and multiscale sequence matching. In this cascade framework, a soft-threshold learning algorithm is utilized to estimate the optimal decision thresholds for detectors, and a multiscale sequence matching method is employed to precisely locate copies using a 2D Hough transform and multigranularities similarity evaluation. Excellent performance on the TRECVID-CBCD 2011 benchmark dataset shows the effectiveness and efficiency of the proposed approach.

Duan, Ling-Yu; Chen J; JR; HT; GW. Learning compact visual descriptors for low bit rate mobile landmark search. AI Magazine [Internet]. 2013;34(2):67-85. 访问链接 Abstract

Along with the ever-growing computational power of mobile devices, mobile visual search has undergone an evolution in techniques and applications. A significant trend is low bit rate visual search, where compact visual descriptors are extracted directly over a mobile and delivered as queries rather than raw images to reduce the query transmission latency. In this article, we introduce our work on low bit rate mobile landmark search, in which a compact yet discriminative landmark image descriptor is extracted by using a location context such as GPS, crowd-sourced hotspot WLAN, and cell tower locations. The compactness originates from the bag-of-words image representation, with offline learning from geotagged photos from online photosharing websites including Flickr and Panoramio. The learning process involves segmenting the landmark photo collection by discrete geographical regions using a Gaussian mixture model and then boosting a ranking-sensitive vocabulary within each region, with "entropy"-based feedback on the compactness of the descriptor to refine both phases iteratively. In online search, when entering a geographical region, the code book in a mobile device is downstream adapted to generate extremely compact descriptors with promising discriminative ability. We have deployed landmark search apps to both HTC and iPhone mobile phones, accessing a database of a million scale images in typical areas like Beijing, New York, and Barcelona, and others. Our descriptor outperforms alternative compact descriptors (Chen et al. 2009; Chen et al., 2010; Chandrasekhar et al. 2009a; Chandrasekhar et al. 2009b) by significant margins. Beyond landmark search, this article will summarize the MPEG standarization progress of compact descriptor for visual search (CDVS) (Yuri et al. 2010; Yuri et al. 2011) toward application interoperability.

黄铁军. 面向高清和3D电视的视频编解码标准AVS+. 电视技术 [Internet]. 2013;(02):11-14. 访问链接 Abstract

介绍了AVS+(GY/T 257—2012《广播电视先进音视频编解码第1部分:视频》)标准的制定背景与过程,重点介绍了AVS+新增加的编码工具以及新特性,介绍了AVS+与AVS、AVC/H.264 HP(High Profile,高级档次)的性能对比,说明了AVS+与AVC HP性能相当。AVS+在多个部委的支持与推进下,将在中国高清与立体电视播出中得到应用。

*Zhang, Xianguo; Huang T; TY; GM; MS; GW. Fast and Efficient Transcoding Based on Low-Complexity Background Modeling and Adaptive Block Classification. IEEE Transactions on Multimedia. 2013;15(8):1769-1785.Abstract

It is in urgent need to develop fast and efficient transcoding methods so as to remarkably save the storage of surveillance videos and synchronously transmit conference videos over different bandwidths. Towards this end, the special characteristics of these videos, e. g., the relatively static background, should be utilized for transcoding. Therefore, we propose a fast and efficient transcoding method (FET) based on background modeling and block classification in this paper. To improve the transcoding efficiency, FET adds the background picture, which is modeled from the originally decoded frames in low complexity, into stream in the form of an intra-coded G-picture. And then, FET utilizes the reconstructed G-picture as the long-term reference frame to transcode the following frames. This is mainly because our theoretical analyses show that G-picture can significantly improve the transcoding performance. To reduce the complexity, FET utilizes an adaptive threshold updating model for block classification and then adopts different transcoding strategies for different categories. This is due to the following statistics: after dividing blocks into categories of foreground, background and hybrid ones, different block categories have different distributions of prediction modes, motion vectors and reference frames. Extensive experiments on transcoding high-bit-rate H. 264/AVC streams to low-bit-rate ones are carried out to evaluate our FET. Over the traditional full-decoding-and-full-encoding methods, FET can save more than 35% of the transcoding bit-rate with a speed-up ratio of larger than 10 on the surveillance videos. On the conference videos which should be transcoded more timely, FET achieves more than 20 times speed- up ratio with 0.2 dB gain.

Huang YT; *YW; ZH; T. Selective eigenbackground for background modeling and subtraction in crowded scenes. IEEE Transactions on Circuits and Systems for Video Technology [Internet]. 2013;23(11):1849-1864. 访问链接 Abstract

Background subtraction is a fundamental preprocessing step in many surveillance video analysis tasks. In spite of significant efforts, however, background subtraction in crowded scenes remains challenging, especially, when a large number of foreground objects move slowly or just keep still. To address the problem, this paper proposes a selective eigenbackground method for background modeling and subtraction in crowded scenes. The contributions of our method are three-fold: First, instead of training eigenbackgrounds using the original video frames that may contain more or less foregrounds, a virtual frame construction algorithm is utilized to assemble clean background pixels from different original frames so as to construct some virtual frames as the training and update samples. This can significantly improve the purity of the trained eigenbackgrounds. Second, for a crowded scene with diversified environmental conditions (e.g., illuminations), it is difficult to use only one eigenbackground model to deal with all these variations, even using some online update strategies. Thus given several models trained offline, we utilize peak signal-to-noise ratio to adaptively choose the optimal one to initialize the online eigenbackground model. Third, to tackle the problem that not all pixels can obtain the optimal results when the reconstruction is performed at once for the whole frame, our method selects the best eigenbackground for each pixel to obtain an improved quality of the reconstructed background image. Extensive experiments on the TRECVID-SED dataset and the Road video dataset show that our method outperforms several state-of-the-art methods remarkably.

tjhuang

北京大学信息科学技术学院教授，博士，计算机科学技术系主任，数字媒体研究所所长，AVS标准工作组秘书长

科研成果 by Type: 期刊论文

Pages

成果类型

最新科研成果