科研成果

2014

Duan, Ling-Yu; *Ji R; CZ; HT; GW. Towards Mobile Document Image Retrieval for Digital Library. IEEE Transactions on Multimedia. 2014;16(2):346-359.Abstract

With the proliferation of mobile devices, recent years have witnessed an emerging potential to integrate mobile visual search techniques into digital library. Such a mobile application scenario in digital library has posed significant and unique challenges in document image search. The mobile photograph makes it tough to extract discriminative features from the landmark regions of documents, like line drawings, as well as text layouts. In addition, both search scalability and query delivery latency remain challenging issues in mobile document search. The former relies on an effective yet memory-light indexing structure to accomplish fast online search, while the latter puts a bit budget constraint of query images over the wireless link. In this paper, we propose a novel mobile document image retrieval framework, consisting of a robust Local Inner-distance Shape Context (LISC) descriptor of line drawings, a Hamming distance KD-Tree for scalable and memory-light document indexing, as well as a JBIG2 based query compression scheme, together with a Retinex based enhancement and an OTSU based binarization, to reduce the latency of delivering query while maintaining query quality in terms of search performance. We have extensively validated the key techniques in this framework by quantitative comparison to alternative approaches.

ZhangXianguo(博士生)；*HuangTieJun；TianYonghong；GaoWen. Background-modeling-based adaptive prediction for surveillance video coding. IEEE Transactions on Image Processing [Internet]. 2014;23(2):769-784. 访问链接 Abstract

The exponential growth of surveillance videos presents an unprecedented challenge for high-efficiency surveillance video coding technology. Compared with the existing coding standards that were basically developed for generic videos, surveillance video coding should be designed to make the best use of the special characteristics of surveillance videos (e.g., relative static background). To do so, this paper first conducts two analyses on how to improve the background and foreground prediction efficiencies in surveillance video coding. Following the analysis results, we propose a background-modeling-based adaptive prediction (BMAP) method. In this method, all blocks to be encoded are firstly classified into three categories. Then, according to the category of each block, two novel inter predictions are selectively utilized, namely, the background reference prediction (BRP) that uses the background modeled from the original input frames as the long-term reference and the background difference prediction (BDP) that predicts the current data in the background difference domain. For background blocks, the BRP can effectively improve the prediction efficiency using the higher quality background as the reference; whereas for foreground-background-hybrid blocks, the BDP can provide a better reference after subtracting its background pixels. Experimental results show that the BMAP can achieve at least twice the compression ratio on surveillance videos as AVC (MPEG-4 Advanced Video Coding) high profile, yet with a slightly additional encoding complexity. Moreover, for the foreground coding performance, which is crucial to the subjective quality of moving objects in surveillance videos, BMAP also obtains remarkable gains over several state-of-the-art methods.

Wen HT; TY; G. IEEE 1857: Boosting Video Applications in CPSS. IEEE Intelligent Systems. 2014;29(2).

Zhang, Xianguo; *Tian Y; HT; DS; GW. Optimizing the hierarchical prediction and coding in HEVC for surveillance and conference videos with background modeling. IEEE Transactions on Image Processing [Internet]. 2014;23(10):4511-4526. 访问链接 Abstract

For the real-time and low-delay video surveillance and teleconferencing applications, the newly video coding standard HEVC can achieve much higher coding efficiency over H.264/AVC. However, we still argue that the hierarchical prediction structure in the HEVC low-delay encoder still does not fully utilize the special characteristics of surveillance and conference videos that are usually captured by stationary cameras. In this case, the background picture (G-picture), which is modeled from the original input frames, can be used to further improve the HEVC low-delay coding efficiency meanwhile reducing the complexity. Therefore, we propose an optimization method for the hierarchical prediction and coding in HEVC for these videos with background modeling. First, several experimental and theoretical analyses are conducted on how to utilize the G-picture to optimize the hierarchical prediction structure and hierarchical quantization. Following these results, we propose to encode the G-picture as the long-term reference frame to improve the background prediction, and then present a G-picture-based bit-allocation algorithm to increase the coding efficiency. Meanwhile, according to the proportions of background and foreground pixels in coding units (CUs), an adaptive speed-up algorithm is developed to classify each CU into different categories and then adopt different speed-up strategies to reduce the encoding complexity. To evaluate the performance, extensive experiments are performed on the HEVC test model. Results show our method can averagely save 39.09% bits and reduce the encoding complexity by 43.63% on surveillance videos, whereas those are 5.27% and 43.68% on conference videos.

MouLuntian(博士生)；*HuangTiejun；TianYonghong；JiangMenglin；GaoWen. Content-based copy detection through multimodal feature representation and temporal pyramid matching. ACM Transactions on Multimedia Computing, Communications, and Applications [Internet]. 2014;10(1). 访问链接 Abstract

Content-based copy detection (CBCD) is drawing increasing attention as an alternative technology to watermarking for video identification and copyright protection. In this article, we present a comprehensive method to detect copies that are subjected to complicated transformations. A multimodal feature representation scheme is designed to exploit the complementarity of audio features, global and local visual features so that optimal overall robustness to a wide range of complicated modifications can be achieved. Meanwhile, a temporal pyramid matching algorithm is proposed to assemble frame-level similarity search results into sequence-level matching results through similarity evaluation over multiple temporal granularities. Additionally, inverted indexing and locality sensitive hashing (LSH) are also adopted to speed up similarity search. Experimental results over benchmarking datasets of TRECVID 2010 and 2009 demonstrate that the proposed method outperforms other methods for most transformations in terms of copy detection accuracy. The evaluation results also suggest that our method can achieve competitive copy localization preciseness.

Xianguo GW; TY; HT; MS; Z. IEEE 1857 Standard Empowering Smart Video Surveillance Systems. IEEE Intelligent Systems. 2014;29(1).Abstract

IEEE 1857, Standard for Advanced Audio and Video Coding, was released as IEEE 1857-2013 in June 2013. Despite consisting of several different groups, the most significant feature of IEEE 1857-2013 is its surveillance groups, which can not only achieve at least twice coding efficiency on surveillance videos as H.264/AVC HP, but also should be the most recognition-friendly video coding standard till to now. This article presents an overview of IEEE 1857 surveillance groups, highlighting the background model based coding technology and recognition-friendly functionalities. We believe that IEEE 1857-2013 will bring new opportunities and drives to the research communities and industries on smart video surveillance systems.

黄铁军. AVS2标准及未来展望. 电视技术 [Internet]. 2014;(22):7-10. 访问链接 Abstract

概要介绍了AVS2国家标准《信息技术高效多媒体编码》系统、视频、音频三个部分,分析了AVS2与AVS1以及AVC/H.264和HEVC/H.265在编码效率上的对比,具体阐释了帧结构、块结构、帧内预测、帧间预测、变换、熵编码、环路滤波等为AVS2带来的编码增益。介绍了AVS2的特色功能——场景视频编码的概念、思路及优势。最后展望了AVS3云媒体编码标准。

黄铁军；郑锦；李波；傅慧源；马华东；薛向阳；姜育刚；于俊清. 多媒体技术研究:2013——面向智能视频监控的视觉感知与处理. 中国图象图形学报 [Internet]. 2014;(11):1539-1562. 访问链接 Abstract

目的随着视频监控技术的日益成熟和监控设备的普及,视频监控应用日益广泛,监控视频数据量呈现出爆炸性的增长,已经成为大数据时代的重要数据对象。然而由于视频数据本身的非结构化特性,使得监控视频数据的处理和分析相对困难。面对大量摄像头采集的监控视频大数据,如何有效地按照视频的内容和特性去传输、存储、分析和识别这些数据,已经成为一种迫切的需求。方法本文面向智能视频监控中大规模视觉感知与智能处理问题,围绕监控视频编码、目标检测与跟踪、监控视频增强、视频运动与异常行为识别等4个主要研究方向,系统阐述2013年度的技术发展状况,并对未来的发展趋势进行展望。结果中国最新制定的国家标准AVS2在对监控视频的编码效...

2013

LiJia(博士后)；*TianYonghong；DuanLingyu；HuangTiejun. Estimating Visual Saliency Through Single Image Optimization. IEEE Signal Processing Letters. 2013;20(9):845-848.Abstract

This letter presents a novel approach for visual saliency estimation through single image optimization. Instead of directly mapping visual features to saliency values with a unified model, we treat regional saliency values as the optimization objective on each single image. By using a quadratic programming framework, our approach can adaptively optimize the regional saliency values on each specific image to simultaneously meet multiple saliency hypotheses on visual rarity, center-bias and mutual correlation. Experimental results show that our approach can outperform 14 state-of-the-art approaches on a public image benchmark.

姜延张海波; 黄铁军;. 基于颜色和纹理特征的面料图像情感语义分析. 天津工业大学学报 [Internet]. 2013;(04):26-32. 访问链接 Abstract

在前期对服装面料图像的情感描述进行研究并建立3维面料图像情感因子空间模型的基础上,通过对面料图像样品的颜色、纹理低层特征(饱和度、色相冷暖、对比度、灰度图、灰度矩阵、平均色调等)和3个因子之间对应关系的分析,得出第1个因子可以用7维特征(6维的饱和度-冷暖模糊直方图加1维的对比度)来表征;第2个因子可以用257维特征(256维的灰度图加1维的彩色对比度)来表征;第3个因子可以用4维特征(3维的灰度矩阵参数加1维的平均色调值)来表征,为实现面料图像情感识别和检索奠定基础.

高文黄铁军; 张贤国;. 支持监控视频高效压缩与识别的IEEE 1857标准. 电子产品世界 [Internet]. 2013;(07):22-26+29. 访问链接 Abstract

我国技术专家为主研究制定的数字视频编解码技术标准AVS于2013年6月4日被国际电子电气工程师协会(IEEE)标准化委员会颁布为IEEE1857标准。该标准独具特色的一个部分是针对视频监控的监控档次AVS-S2,编码压缩性能达到目前视频监控业界主流使用的H.264(又称MPEG-4AVC)标准的两倍,而且在码流层支持感兴趣区域的自动提取与表达。本文介绍了AVS-S2的制定过程、关键技术及其与其它标准的压缩效率对比情况。

王国中黄铁军; 高文;. 数字音视频编解码技术标准AVS发展历程与应用前景. 上海大学学报(自然科学版) [Internet]. 2013;(03):221-224. 访问链接 Abstract

数字音视频编解码技术标准AVS是我国自主创新战略实施的典型案例.在认真分析本领域国内外知识产权现状的基础上,相继制定了AVS国家标准、广电行业标准和IEEE国际标准,有力地支撑了我国数字视听产业"由大变强"的新格局.建立了由上百项自主专利组成的专利群,扭转了我国本领域相关企业长期受制于国外标准高额专利费而难以健康发展的被动局面.带动20多家芯片企业开发出了符合的芯片,构建了"以我为主、全面开放"的完整产业链.全国20多个省市和多个国家采用了AVS标准播出的电视频道上千路.目前,中国的中央电视台正在部署采用这个标准进行高清立体节目的卫星播出工作.

田永鸿许腾(硕士生); 黄. 车载视觉系统中的行人检测技术综述. 中国图象图形学报 [Internet]. 2013;(04):359-367. 访问链接 Abstract

作为计算机视觉以及智能车辆领域的一个重要研究方向,车载视觉系统中的行人检测技术近年来得到了业界广泛关注。本文对2005年以来该技术中最重要的两个环节——感兴趣区域分割以及目标识别的研究现状进行综述,首先将感兴趣区域分割的典型方法按照分割所用信息的不同进行分类并对比它们的优缺点,之后对行人目标识别的特征提取、分类器构造以及搜索框架等方面的进展进行总结,最后对未来发展作出展望。

刘瑞璞张海波; 黄铁军;. 基于颜色特征的男西装图像情感语义分析. 东华大学学报(自然科学版) [Internet]. 2013;(02):185-190+195. 访问链接 Abstract

在前期对男西装情感描述研究并建立二维男西装情感因子空间模型的基础上,通过对男西装图像样品的颜色特征(色相冷暖、色彩亮度及对比度)的分析,得出第一个情感因子可以较好地用10维亮度——冷暖模糊直方图解释,第二个情感因子可以利用7维的饱和度——冷暖模糊直方图和图像对比度综合起来解释.研究结果为下一步实现男西装图像情感识别和检索打下基础.

Duan, Ling-Yu; *Ji R; CJ; YH; HT; GW. Learning from mobile contexts to minimize the mobile location search latency. Signal Processing: Image Communication [Internet]. 2013;28(4):368-385. 访问链接 Abstract

We propose to learn an extremely compact visual descriptor from the mobile contexts towards low bit rate mobile location search. Our scheme combines location related side information from the mobile devices to adaptively supervise the compact visual descriptor design in a flexible manner, which is very suitable to search locations or landmarks within a bandwidth constraint wireless link. Along with the proposed compact descriptor learning, a large-scale, contextual aware mobile visual search benchmark dataset PKUBench is also introduced, which serves as the first comprehensive benchmark for the quantitative evaluation of how the cheaply available mobile contexts can help the mobile visual search systems. Our proposed contextual learning based compact descriptor has shown to outperform the existing works in terms of compression rate and retrieval effectiveness.

Tian, Yonghong; *Huang T; JM; GW. Video copy-detection and localization with a scalable cascading framework. IEEE Multimedia [Internet]. 2013;20(3):72-86. 访问链接 Abstract

For video copy detection, no single audio-visual feature, or single detector based on several features, can work well for all transformations. This article proposes a novel video copy-detection and localization approach with scalable cascading of complementary detectors and multiscale sequence matching. In this cascade framework, a soft-threshold learning algorithm is utilized to estimate the optimal decision thresholds for detectors, and a multiscale sequence matching method is employed to precisely locate copies using a 2D Hough transform and multigranularities similarity evaluation. Excellent performance on the TRECVID-CBCD 2011 benchmark dataset shows the effectiveness and efficiency of the proposed approach.

Duan, Ling-Yu; Chen J; JR; HT; GW. Learning compact visual descriptors for low bit rate mobile landmark search. AI Magazine [Internet]. 2013;34(2):67-85. 访问链接 Abstract

Along with the ever-growing computational power of mobile devices, mobile visual search has undergone an evolution in techniques and applications. A significant trend is low bit rate visual search, where compact visual descriptors are extracted directly over a mobile and delivered as queries rather than raw images to reduce the query transmission latency. In this article, we introduce our work on low bit rate mobile landmark search, in which a compact yet discriminative landmark image descriptor is extracted by using a location context such as GPS, crowd-sourced hotspot WLAN, and cell tower locations. The compactness originates from the bag-of-words image representation, with offline learning from geotagged photos from online photosharing websites including Flickr and Panoramio. The learning process involves segmenting the landmark photo collection by discrete geographical regions using a Gaussian mixture model and then boosting a ranking-sensitive vocabulary within each region, with "entropy"-based feedback on the compactness of the descriptor to refine both phases iteratively. In online search, when entering a geographical region, the code book in a mobile device is downstream adapted to generate extremely compact descriptors with promising discriminative ability. We have deployed landmark search apps to both HTC and iPhone mobile phones, accessing a database of a million scale images in typical areas like Beijing, New York, and Barcelona, and others. Our descriptor outperforms alternative compact descriptors (Chen et al. 2009; Chen et al., 2010; Chandrasekhar et al. 2009a; Chandrasekhar et al. 2009b) by significant margins. Beyond landmark search, this article will summarize the MPEG standarization progress of compact descriptor for visual search (CDVS) (Yuri et al. 2010; Yuri et al. 2011) toward application interoperability.

*Zhang, Xianguo; Huang T; TY; GM; MS; GW. Fast and Efficient Transcoding Based on Low-Complexity Background Modeling and Adaptive Block Classification. IEEE Transactions on Multimedia. 2013;15(8):1769-1785.Abstract

It is in urgent need to develop fast and efficient transcoding methods so as to remarkably save the storage of surveillance videos and synchronously transmit conference videos over different bandwidths. Towards this end, the special characteristics of these videos, e. g., the relatively static background, should be utilized for transcoding. Therefore, we propose a fast and efficient transcoding method (FET) based on background modeling and block classification in this paper. To improve the transcoding efficiency, FET adds the background picture, which is modeled from the originally decoded frames in low complexity, into stream in the form of an intra-coded G-picture. And then, FET utilizes the reconstructed G-picture as the long-term reference frame to transcode the following frames. This is mainly because our theoretical analyses show that G-picture can significantly improve the transcoding performance. To reduce the complexity, FET utilizes an adaptive threshold updating model for block classification and then adopts different transcoding strategies for different categories. This is due to the following statistics: after dividing blocks into categories of foreground, background and hybrid ones, different block categories have different distributions of prediction modes, motion vectors and reference frames. Extensive experiments on transcoding high-bit-rate H. 264/AVC streams to low-bit-rate ones are carried out to evaluate our FET. Over the traditional full-decoding-and-full-encoding methods, FET can save more than 35% of the transcoding bit-rate with a speed-up ratio of larger than 10 on the surveillance videos. On the conference videos which should be transcoded more timely, FET achieves more than 20 times speed- up ratio with 0.2 dB gain.

Huang YT; *YW; ZH; T. Selective eigenbackground for background modeling and subtraction in crowded scenes. IEEE Transactions on Circuits and Systems for Video Technology [Internet]. 2013;23(11):1849-1864. 访问链接 Abstract

Background subtraction is a fundamental preprocessing step in many surveillance video analysis tasks. In spite of significant efforts, however, background subtraction in crowded scenes remains challenging, especially, when a large number of foreground objects move slowly or just keep still. To address the problem, this paper proposes a selective eigenbackground method for background modeling and subtraction in crowded scenes. The contributions of our method are three-fold: First, instead of training eigenbackgrounds using the original video frames that may contain more or less foregrounds, a virtual frame construction algorithm is utilized to assemble clean background pixels from different original frames so as to construct some virtual frames as the training and update samples. This can significantly improve the purity of the trained eigenbackgrounds. Second, for a crowded scene with diversified environmental conditions (e.g., illuminations), it is difficult to use only one eigenbackground model to deal with all these variations, even using some online update strategies. Thus given several models trained offline, we utilize peak signal-to-noise ratio to adaptively choose the optimal one to initialize the online eigenbackground model. Third, to tackle the problem that not all pixels can obtain the optimal results when the reconstruction is performed at once for the whole frame, our method selects the best eigenbackground for each pixel to obtain an improved quality of the reconstructed background image. Extensive experiments on the TRECVID-SED dataset and the Road video dataset show that our method outperforms several state-of-the-art methods remarkably.

黄铁军. 面向高清和3D电视的视频编解码标准AVS+. 电视技术 [Internet]. 2013;(02):11-14. 访问链接 Abstract

介绍了AVS+(GY/T 257—2012《广播电视先进音视频编解码第1部分:视频》)标准的制定背景与过程,重点介绍了AVS+新增加的编码工具以及新特性,介绍了AVS+与AVS、AVC/H.264 HP(High Profile,高级档次)的性能对比,说明了AVS+与AVC HP性能相当。AVS+在多个部委的支持与推进下,将在中国高清与立体电视播出中得到应用。

tjhuang

北京大学信息科学技术学院教授，博士，计算机科学技术系主任，数字媒体研究所所长，AVS标准工作组秘书长

科研成果

Pages

成果类型

成果概览

最新科研成果