科研成果

2025
Quan Y, Wan X, Tang Z, Liang J, Ji H. Multi-Focus Image Fusion via Explicit Defocus Blur Modelling, in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).; 2025.Abstract
Multi-focus image fusion (MFIF) is a critical technique for enhancing depth of field in photography, producing an all-in-focus image from multiple images captured at different focal lengths. While deep learning has shown promise in MFIF, most existing methods ignore the physical model of defocus blurring in their neural architecture design, limiting their interoperability and generalization. This paper presents a novel framework that integrates explicit defocus blur modeling into the MFIF process, leading to enhanced interpretability and performance. Leveraging an atom-based spatially-varying parameterized defocus blurring model, our approach first calculates pixel-wise defocus descriptors and initial focused images from multi-focus source images through a scale-recurrent fashion, based on which soft decision maps are estimated. Afterward, image fusion is performed using masks constructed from the decision maps, with a separate treatment on pixels that are probably defocused in all source images or near boundaries of defocused/focused regions. Model training is done with a fusion loss and a cross-scale defocus estimation loss. Extensive experiments on benchmark datasets have demonstrated the effectiveness of our approach.
Huang Y, Liao X, Liang J, Shi B, Xu Y, Le Callet P. Detail-Preserving Diffusion Models for Low-Light Image Enhancement. IEEE Transactions on Circuits and Systems for Video Technology. 2025;35:3396–3409.Abstract
Existing diffusion models for low-light image enhancement typically incrementally remove noise introduced during the forward diffusion process using a denoising loss, with the process being conditioned on input low-light images. While these models demonstrate remarkable abilities in generating realistic high-frequency details, they often struggle to restore fine details that are faithful to the input. To address this, we present a novel detail-preserving diffusion model for realistic and faithful low-light image enhancement. Our approach integrates a size-agnostic diffusion process with a reverse process reconstruction loss, significantly enhancing the fidelity of enhanced images to their low-light counterparts and enabling more accurate recovery of fine details. To ensure the preservation of region- and content-aware details, we employ an efficient noise estimation network with a simplified channel-spatial attention mechanism. Additionally, we propose a multiscale ensemble scheme to maintain detail fidelity across diverse illumination regions. Comprehensive experiments on eight benchmark datasets demonstrate that our method achieves state-of-the-art results compared to over twenty existing methods in terms of both perceptual quality (LPIPS) and distortion metrics (PSNR and SSIM). The code is available at: https://github.com/CSYanH/DePDiff.
Huang Y, Liao X, Liang J, Quan Y, Shi B, Xu Y. Zero-Shot Low-Light Image Enhancement via Latent Diffusion Models, in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).; 2025.Abstract
Low-light image enhancement (LLIE) aims to improve visibility and signal-to-noise ratio in images captured under poor lighting conditions. Despite impressive improvement, deep learning-based LLIE approaches require extensive training data, which is often difficult and costly to obtain. In this paper, we propose a zero-shot LLIE framework leveraging pre-trained latent diffusion models for the first time, which act as powerful priors to recover latent images from low-light inputs. Our approach introduces several components to alleviate the inherent challenges in utilizing pre-trained latent diffusion models, modeling the degradation process in an image-adaptive manner, penalizing the latent outside the manifold of natural images, and balancing the strengths of the guidance from the given low-light image during the denoising process. Experimental results demonstrate that our framework outperforms existing methods, achieving superior performance across various datasets.
2024
Lou H, Liang J, Teng M, Fan B, Xu Y, Shi B. Zero-Shot Event-Intensity Asymmetric Stereo via Visual Prompting from Image Domain, in Advances in Neural Information Processing Systems.Vol 37.; 2024:13274–13301.
Yu B, Liang J, Wang Z, Fan B, Subpa-asa A, Shi B, Sato I. Active Hyperspectral Imaging Using an Event Camera, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).; 2024.Abstract
Hyperspectral imaging plays a critical role in numerous scientific and industrial fields. Conventional hyperspectral imaging systems often struggle with the trade-off between spectral and temporal resolution, particularly in dynamic environments. In ours work, we present an innovative event-based active hyperspectral imaging system designed for real-time performance in dynamic scenes. By integrating a diffraction grating and rotating mirror with an event-based camera, the proposed system captures high-fidelity spectral information at a microsecond temporal resolution, leveraging the event camera's unique capability to detect instantaneous changes in brightness rather than absolute intensity. The proposed system trade-off between conventional frame-based systems by reducing the bandwidth and computational load and mosaic-based system by remaining the original sensor spatial resolution. It records only meaningful changes in brightness, achieving high temporal and spectral resolution with minimal latency and is practical for real-time applications in complex dynamic conditions.
Yu B, Ren J, Han J, Wang F, Liang J, Shi B. EventPS: Real-Time Photometric Stereo Using an Event Camera, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).; 2024:9602–9611.Abstract
Photometric stereo is a well-established technique to estimate the surface normal of an object. However the requirement of capturing multiple high dynamic range images under different illumination conditions limits the speed and real-time applications. This paper introduces EventPS a novel approach to real-time photometric stereo using an event camera. Capitalizing on the exceptional temporal resolution dynamic range and low bandwidth characteristics of event cameras EventPS estimates surface normal only from the radiance changes significantly enhancing data efficiency. EventPS seamlessly integrates with both optimization-based and deep-learning-based photometric stereo techniques to offer a robust solution for non-Lambertian surfaces. Extensive experiments validate the effectiveness and efficiency of EventPS compared to frame-based counterparts. Our algorithm runs at over 30 fps in real-world scenarios unleashing the potential of EventPS in time-sensitive and high-speed downstream applications.
Zhong H, Hong Y, Weng S, Liang J, Shi B. Language-Guided Image Reflection Separation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).; 2024:24913–24922.Abstract
This paper studies the problem of language-guided reflection separation which aims at addressing the ill-posed reflection separation problem by introducing language descriptions to provide layer content. We propose a unified framework to solve this problem which leverages the cross-attention mechanism with contrastive learning strategies to construct the correspondence between language descriptions and image layers. A gated network design and a randomized training strategy are employed to tackle the recognizable layer ambiguity. The effectiveness of the proposed method is validated by the significant performance advantage over existing reflection separation methods on both quantitative and qualitative comparisons.
Yang Y, Liang J, Yu B, Chen Y, Ren JS, Shi B. Latency Correction for Event-guided Deblurring and Frame Interpolation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).; 2024:24977–24986.Abstract
Event cameras with their high temporal resolution dynamic range and low power consumption are particularly good at time-sensitive applications like deblurring and frame interpolation. However their performance is hindered by latency variability especially under low-light conditions and with fast-moving objects. This paper addresses the challenge of latency in event cameras – the temporal discrepancy between the actual occurrence of changes in the corresponding timestamp assigned by the sensor. Focusing on event-guided deblurring and frame interpolation tasks we propose a latency correction method based on a parameterized latency model. To enable data-driven learning we develop an event-based temporal fidelity to describe the sharpness of latent images reconstructed from events and the corresponding blurry images and reformulate the event-based double integral model differentiable to latency. The proposed method is validated using synthetic and real-world datasets demonstrating the benefits of latency correction for deblurring and interpolation across different lighting conditions.
Hong Y, Zhong H, Weng S, Liang J, Shi B. L-DiffER: Single Image Reflection Removal with Language-based Diffusion Model, in Proceedings of the European Conference on Computer Vision (ECCV).; 2024.Abstract
In this paper, we introduce L-DiffER, a language-based diffusion model designed for the ill-posed single image reflection removal task. Although having shown impressive performance for image generation, existing language-based diffusion models struggle with precise control and faithfulness in image restoration. To overcome these limitations, we propose an iterative condition refinement strategy to resolve the problem of inaccurate control conditions. A multi-condition constraint mechanism is employed to ensure the recovery faithfulness of image color and structure while retaining the generation capability to handle low-transmitted reflections. We demonstrate the superiority of the proposed method through extensive experiments, showcasing both quantitative and qualitative improvements over existing methods.
Hong Y, Chang Y, Liang J, Ma L, Huang T, Shi B. Light Flickering Guided Reflection Removal. International Journal of Computer Vision (IJCV). 2024.Abstract
When photographing through a piece of glass, reflections usually degrade the quality of captured images or videos. In this paper, by exploiting periodically varying light flickering, we investigate the problem of removing strong reflections from contaminated image sequences or videos with a unified capturing setup. We propose a learning-based method that utilizes short-term and long-term observations of mixture videos to exploit one-side contextual clues in fluctuant components and brightness-consistent clues in consistent components for achieving layer separation and flickering removal, respectively. A dataset containing synthetic and real mixture videos with light flickering is built for network training and testing. The effectiveness of the proposed method is demonstrated by the comprehensive evaluation on synthetic and real data, the application for video flickering removal, and the exploratory experiment on high-speed scenes.
2023
Zhou C, Teng M, Han J, Liang J, Xu C, Cao G, Shi B. Deblurring Low-Light Images with Events. International Journal of Computer Vision (IJCV). 2023;131:1284–1298.Abstract
Modern image-based deblurring methods usually show degenerate performance in low-light conditions since the images often contain most of the poorly visible dark regions and a few saturated bright regions, making the amount of effective features that can be extracted for deblurring limited. In contrast, event cameras can trigger events with a very high dynamic range and low latency, which hardly suffer from saturation and naturally encode dense temporal information about motion. However, in low-light conditions existing event-based deblurring methods would become less robust since the events triggered in dark regions are often severely contaminated by noise, leading to inaccurate reconstruction of the corresponding intensity values. Besides, since they directly adopt the event-based double integral model to perform pixel-wise reconstruction, they can only handle low-resolution grayscale active pixel sensor images provided by the DAVIS camera, which cannot meet the requirement of daily photography. In this paper, to apply events to deblurring low-light images robustly, we propose a unified two-stage framework along with a motion-aware neural network tailored to it, reconstructing the sharp image under the guidance of high-fidelity motion clues extracted from events. Besides, we build an RGB-DAVIS hybrid camera system to demonstrate that our method has the ability to deblur high-resolution RGB images due to the natural advantages of our two-stage framework. Experimental results show our method achieves state-of-the-art performance on both synthetic and real-world images.
Lv J, Guo H, Chen G, Liang J, Shi B. Non-Lambertian Multispectral Photometric Stereo via Spectral Reflectance Decomposition, in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI). Macau, SAR China; 2023:1249–1257.Abstract
Multispectral photometric stereo (MPS) aims at recovering the surface normal of a scene from a single-shot multispectral image captured under multispectral illuminations. Existing MPS methods adopt the Lambertian reflectance model to make the problem tractable, but it greatly limits their application to real-world surfaces. In this paper, we propose a deep neural network named NeuralMPS to solve the MPS problem under non-Lambertian spectral reflectances. Specifically, we present a spectral reflectance decomposition model to disentangle the spectral reflectance into a geometric component and a spectral component. With this decomposition, we show that the MPS problem for surfaces with a uniform material is equivalent to the conventional photometric stereo (CPS) with unknown light intensities. In this way, NeuralMPS reduces the difficulty of the non-Lambertian MPS problem by leveraging the well-studied non-Lambertian CPS methods. Experiments on both synthetic and real-world scenes demonstrate the effectiveness of our method.
Yang Y, Han J, Liang J, Sato I, Shi B. Learning Event Guided High Dynamic Range Video Reconstruction, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).; 2023:13924–13934.Abstract
Limited by the trade-off between frame rate and exposure time when capturing moving scenes with conventional cameras, frame based HDR video reconstruction suffers from scene-dependent exposure ratio balancing and ghosting artifacts. Event cameras provide an alternative visual representation with a much higher dynamic range and temporal resolution free from the above issues, which could be an effective guidance for HDR imaging from LDR videos. In this paper, we propose a multimodal learning framework for event guided HDR video reconstruction. In order to better leverage the knowledge of the same scene from the two modalities of visual signals, a multimodal representation alignment strategy to learn a shared latent space and a fusion module tailored to complementing two types of signals for different dynamic ranges in different regions are proposed. Temporal correlations are utilized recurrently to suppress the flickering effects in the reconstructed HDR video. The proposed HDRev-Net demonstrates state-of-the-art performance quantitatively and qualitatively for both synthetic and real-world data.
Liang J, Yang Y, Li B, Duan P, Xu Y, Shi B. Coherent Event Guided Low-Light Video Enhancement, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).; 2023:10615–10625.Abstract
With frame-based cameras, capturing fast-moving scenes without suffering from blur often comes at the cost of low SNR and low contrast. Worse still, the photometric constancy that enhancement techniques heavily relied on is fragile for frames with short exposure. Event cameras can record brightness changes at an extremely high temporal resolution. For low-light videos, event data are not only suitable to help capture temporal correspondences but also provide alternative observations in the form of intensity ratios between consecutive frames and exposure-invariant information. Motivated by this, we propose a low-light video enhancement method with hybrid inputs of events and frames. Specifically, a neural network is trained to establish spatiotemporal coherence between visual signals with different modalities and resolutions by constructing correlation volume across space and time. Experimental results on synthetic and real data demonstrate the superiority of the proposed method compared to the state-of-the-art methods.
2022
Song Y, Wang J, Ma L, Yu J, Liang J, Yuan L, Yu Z. MARN: Multi-level Attentional Reconstruction Networks for Weakly Supervised Video Temporal Grounding. Neurocomputing. 2022;554:126625.Abstract
Video temporal grounding is a challenging task in computer vision that involves localizing a video segment semantically related to a given query from a set of videos and queries. In this paper, we propose a novel weakly-supervised model called the Multi-level Attentional Reconstruction Networks (MARN), which is trained on video-sentence pairs. During the training phase, we leverage the idea of attentional reconstruction to train an attention map that can reconstruct the given query. At inference time, proposals are ranked based on attention scores to localize the most suitable segment. In contrast to previous methods, MARN effectively aligns video-level supervision and proposal scoring, thereby reducing the training-inference discrepancy. In addition, we incorporate a multi-level framework that encompasses both proposal-level and clip-level processes. The proposal-level process generates and scores variable-length time sequences, while the clip-level process generates and scores fix-length time sequences to refine the predicted scores of the proposal in both training and testing. To improve the feature representation of the video, we propose a novel representation mechanism that utilizes intra-proposal information and adopts 2D convolution to extract inter-proposal clues for learning reliable attention maps. By accurately representing these proposals, we can better align them with the textual modalities, and thus facilitate the learning of the model. Our proposed MARN is evaluated on two benchmark datasets, and extensive experiments demonstrate its superiority over existing methods.
Liang J, Xu Y, Quan Y, Shi B, Ji H. Self-Supervised Low-Light Image Enhancement Using Discrepant Untrained Network Priors. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). 2022;32:7332–7345.Abstract
This paper proposes a deep learning method for low-light image enhancement, which exploits the generation capability of Neural Networks (NNs) while requiring no training samples except the input image itself. Based on the Retinex decomposition model, the reflectance and illumination of a low-light image are parameterized by two untrained NNs. The ambiguity between the two layers is resolved by the discrepancy between the two NNs in terms of architecture and capacity, while the complex noise with spatially-varying characteristics is handled by an illumination-adaptive self-supervised denoising module. The enhancement is done by jointly optimizing the Retinex decomposition and the illumination adjustment. Extensive experiments show that the proposed method not only outperforms existing non-learning-based and unsupervised-learning-based methods, but also competes favorably with some supervised-learning-based methods in extreme low-light conditions.
2021
Xu Y, Li F, Chen Z, Liang J, Quan Y. Encoding Spatial Distribution of Convolutional Features for Texture Representation, in Advances in Neural Information Processing Systems (NeurIPS).; 2021.Abstract
With frame-based cameras, capturing fast-moving scenes without suffering from blur often comes at the cost of low SNR and low contrast. Worse still, the photometric constancy that enhancement techniques heavily relied on is fragile for frames with short exposure. Event cameras can record brightness changes at an extremely high temporal resolution. For low-light videos, event data are not only suitable to help capture temporal correspondences but also provide alternative observations in the form of intensity ratios between consecutive frames and exposure-invariant information. Motivated by this, we propose a low-light video enhancement method with hybrid inputs of events and frames. Specifically, a neural network is trained to establish spatiotemporal coherence between visual signals with different modalities and resolutions by constructing correlation volume across space and time. Experimental results on synthetic and real data demonstrate the superiority of the proposed method compared to the state-of-the-art methods.
Liang J, Wang J, Quan Y, Chen T, Liu J, Ling H, Xu Y. Recurrent Exposure Generation for Low-Light Face Detection. IEEE Transactions on Multimedia. 2021;24:1609–1621.
2020
Yang W, Yuan Y, Ren W, Liu J, Scheirer WJ, Wang Z, Zhang T, Zhong Q, Xie D, Pu S, et al. Advancing Image Understanding in Poor Visibility Environments: A Collective Benchmark Study. IEEE Transactions on Image Processing (TIP). 2020;29:5737–5752.Abstract
Existing enhancement methods are empirically expected to help the high-level end computer vision task: however, that is observed to not always be the case in practice. We focus on object or face detection in poor visibility enhancements caused by bad weathers (haze, rain) and low light conditions. To provide a more thorough examination and fair comparison, we introduce three benchmark sets collected in real-world hazy, rainy, and low-light conditions, respectively, with annotated objects/faces. We launched the UG2+ challenge Track 2 competition in IEEE CVPR 2019, aiming to evoke a comprehensive discussion and exploration about whether and how low-level vision techniques can benefit the high-level automatic visual recognition in various scenarios. To our best knowledge, this is the first and currently largest effort of its kind. Baseline results by cascading existing enhancement and detection models are reported, indicating the highly challenging nature of our new data as well as the large room for further technical innovations. Thanks to a large participation from the research community, we are able to analyze representative team solutions, striving to better identify the strengths and limitations of existing mindsets as well as the future directions.
2019
Liang J, Xu Y, Bao C, Quan Y, Ji H. Barzilai–Borwein-based Adaptive Learning Rate for Deep Learning. Pattern Recognition Letters (PRL). 2019;128:197–203.Abstract
Learning rate is arguably the most important hyper-parameter to tune when training a neural network. As manually setting right learning rate remains a cumbersome process, adaptive learning rate algorithms aim at automating such a process. Motivated by the success of the Barzilai–Borwein (BB) step-size method in many gradient descent methods for solving convex problems, this paper aims at investigating the potential of the BB method for training neural networks. With strong motivation from related convergence analysis, the BB method is generalized to adaptive learning rate of mini-batch gradient descent. The experiments showed that, in contrast to many existing methods, the proposed BB method is highly insensitive to initial learning rate, especially in terms of generalization performance. Also, the BB method showed its advantages on both learning speed and generalization performance over other available methods.