科研成果 by Type: Conference Paper

2025
Huang Y, Liao X, Liang J, Quan Y, Shi B, Xu Y. Zero-Shot Low-Light Image Enhancement via Latent Diffusion Models, in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).; 2025.Abstract
Low-light image enhancement (LLIE) aims to improve visibility and signal-to-noise ratio in images captured under poor lighting conditions. Despite impressive improvement, deep learning-based LLIE approaches require extensive training data, which is often difficult and costly to obtain. In this paper, we propose a zero-shot LLIE framework leveraging pre-trained latent diffusion models for the first time, which act as powerful priors to recover latent images from low-light inputs. Our approach introduces several components to alleviate the inherent challenges in utilizing pre-trained latent diffusion models, modeling the degradation process in an image-adaptive manner, penalizing the latent outside the manifold of natural images, and balancing the strengths of the guidance from the given low-light image during the denoising process. Experimental results demonstrate that our framework outperforms existing methods, achieving superior performance across various datasets.
Quan Y, Wan X, Tang Z, Liang J, Ji H. Multi-Focus Image Fusion via Explicit Defocus Blur Modelling, in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).; 2025.Abstract
Multi-focus image fusion (MFIF) is a critical technique for enhancing depth of field in photography, producing an all-in-focus image from multiple images captured at different focal lengths. While deep learning has shown promise in MFIF, most existing methods ignore the physical model of defocus blurring in their neural architecture design, limiting their interoperability and generalization. This paper presents a novel framework that integrates explicit defocus blur modeling into the MFIF process, leading to enhanced interpretability and performance. Leveraging an atom-based spatially-varying parameterized defocus blurring model, our approach first calculates pixel-wise defocus descriptors and initial focused images from multi-focus source images through a scale-recurrent fashion, based on which soft decision maps are estimated. Afterward, image fusion is performed using masks constructed from the decision maps, with a separate treatment on pixels that are probably defocused in all source images or near boundaries of defocused/focused regions. Model training is done with a fusion loss and a cross-scale defocus estimation loss. Extensive experiments on benchmark datasets have demonstrated the effectiveness of our approach.
2024
Yang Y, Liang J, Yu B, Chen Y, Ren JS, Shi B. Latency Correction for Event-guided Deblurring and Frame Interpolation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).; 2024:24977–24986.Abstract
Event cameras with their high temporal resolution dynamic range and low power consumption are particularly good at time-sensitive applications like deblurring and frame interpolation. However their performance is hindered by latency variability especially under low-light conditions and with fast-moving objects. This paper addresses the challenge of latency in event cameras – the temporal discrepancy between the actual occurrence of changes in the corresponding timestamp assigned by the sensor. Focusing on event-guided deblurring and frame interpolation tasks we propose a latency correction method based on a parameterized latency model. To enable data-driven learning we develop an event-based temporal fidelity to describe the sharpness of latent images reconstructed from events and the corresponding blurry images and reformulate the event-based double integral model differentiable to latency. The proposed method is validated using synthetic and real-world datasets demonstrating the benefits of latency correction for deblurring and interpolation across different lighting conditions.
Yu B, Liang J, Wang Z, Fan B, Subpa-asa A, Shi B, Sato I. Active Hyperspectral Imaging Using an Event Camera, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).; 2024.Abstract
Hyperspectral imaging plays a critical role in numerous scientific and industrial fields. Conventional hyperspectral imaging systems often struggle with the trade-off between spectral and temporal resolution, particularly in dynamic environments. In ours work, we present an innovative event-based active hyperspectral imaging system designed for real-time performance in dynamic scenes. By integrating a diffraction grating and rotating mirror with an event-based camera, the proposed system captures high-fidelity spectral information at a microsecond temporal resolution, leveraging the event camera's unique capability to detect instantaneous changes in brightness rather than absolute intensity. The proposed system trade-off between conventional frame-based systems by reducing the bandwidth and computational load and mosaic-based system by remaining the original sensor spatial resolution. It records only meaningful changes in brightness, achieving high temporal and spectral resolution with minimal latency and is practical for real-time applications in complex dynamic conditions.
Yu B, Ren J, Han J, Wang F, Liang J, Shi B. EventPS: Real-Time Photometric Stereo Using an Event Camera, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).; 2024:9602–9611.Abstract
Photometric stereo is a well-established technique to estimate the surface normal of an object. However the requirement of capturing multiple high dynamic range images under different illumination conditions limits the speed and real-time applications. This paper introduces EventPS a novel approach to real-time photometric stereo using an event camera. Capitalizing on the exceptional temporal resolution dynamic range and low bandwidth characteristics of event cameras EventPS estimates surface normal only from the radiance changes significantly enhancing data efficiency. EventPS seamlessly integrates with both optimization-based and deep-learning-based photometric stereo techniques to offer a robust solution for non-Lambertian surfaces. Extensive experiments validate the effectiveness and efficiency of EventPS compared to frame-based counterparts. Our algorithm runs at over 30 fps in real-world scenarios unleashing the potential of EventPS in time-sensitive and high-speed downstream applications.
Zhong H, Hong Y, Weng S, Liang J, Shi B. Language-Guided Image Reflection Separation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).; 2024:24913–24922.Abstract
This paper studies the problem of language-guided reflection separation which aims at addressing the ill-posed reflection separation problem by introducing language descriptions to provide layer content. We propose a unified framework to solve this problem which leverages the cross-attention mechanism with contrastive learning strategies to construct the correspondence between language descriptions and image layers. A gated network design and a randomized training strategy are employed to tackle the recognizable layer ambiguity. The effectiveness of the proposed method is validated by the significant performance advantage over existing reflection separation methods on both quantitative and qualitative comparisons.
Hong Y, Zhong H, Weng S, Liang J, Shi B. L-DiffER: Single Image Reflection Removal with Language-based Diffusion Model, in Proceedings of the European Conference on Computer Vision (ECCV).; 2024.Abstract
In this paper, we introduce L-DiffER, a language-based diffusion model designed for the ill-posed single image reflection removal task. Although having shown impressive performance for image generation, existing language-based diffusion models struggle with precise control and faithfulness in image restoration. To overcome these limitations, we propose an iterative condition refinement strategy to resolve the problem of inaccurate control conditions. A multi-condition constraint mechanism is employed to ensure the recovery faithfulness of image color and structure while retaining the generation capability to handle low-transmitted reflections. We demonstrate the superiority of the proposed method through extensive experiments, showcasing both quantitative and qualitative improvements over existing methods.
Lou H, Liang J, Teng M, Fan B, Xu Y, Shi B. Zero-Shot Event-Intensity Asymmetric Stereo via Visual Prompting from Image Domain, in Advances in Neural Information Processing Systems.Vol 37.; 2024:13274–13301.
2023
Yang Y, Han J, Liang J, Sato I, Shi B. Learning Event Guided High Dynamic Range Video Reconstruction, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).; 2023:13924–13934.Abstract
Limited by the trade-off between frame rate and exposure time when capturing moving scenes with conventional cameras, frame based HDR video reconstruction suffers from scene-dependent exposure ratio balancing and ghosting artifacts. Event cameras provide an alternative visual representation with a much higher dynamic range and temporal resolution free from the above issues, which could be an effective guidance for HDR imaging from LDR videos. In this paper, we propose a multimodal learning framework for event guided HDR video reconstruction. In order to better leverage the knowledge of the same scene from the two modalities of visual signals, a multimodal representation alignment strategy to learn a shared latent space and a fusion module tailored to complementing two types of signals for different dynamic ranges in different regions are proposed. Temporal correlations are utilized recurrently to suppress the flickering effects in the reconstructed HDR video. The proposed HDRev-Net demonstrates state-of-the-art performance quantitatively and qualitatively for both synthetic and real-world data.
Liang J, Yang Y, Li B, Duan P, Xu Y, Shi B. Coherent Event Guided Low-Light Video Enhancement, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).; 2023:10615–10625.Abstract
With frame-based cameras, capturing fast-moving scenes without suffering from blur often comes at the cost of low SNR and low contrast. Worse still, the photometric constancy that enhancement techniques heavily relied on is fragile for frames with short exposure. Event cameras can record brightness changes at an extremely high temporal resolution. For low-light videos, event data are not only suitable to help capture temporal correspondences but also provide alternative observations in the form of intensity ratios between consecutive frames and exposure-invariant information. Motivated by this, we propose a low-light video enhancement method with hybrid inputs of events and frames. Specifically, a neural network is trained to establish spatiotemporal coherence between visual signals with different modalities and resolutions by constructing correlation volume across space and time. Experimental results on synthetic and real data demonstrate the superiority of the proposed method compared to the state-of-the-art methods.
Lv J, Guo H, Chen G, Liang J, Shi B. Non-Lambertian Multispectral Photometric Stereo via Spectral Reflectance Decomposition, in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI). Macau, SAR China; 2023:1249–1257.Abstract
Multispectral photometric stereo (MPS) aims at recovering the surface normal of a scene from a single-shot multispectral image captured under multispectral illuminations. Existing MPS methods adopt the Lambertian reflectance model to make the problem tractable, but it greatly limits their application to real-world surfaces. In this paper, we propose a deep neural network named NeuralMPS to solve the MPS problem under non-Lambertian spectral reflectances. Specifically, we present a spectral reflectance decomposition model to disentangle the spectral reflectance into a geometric component and a spectral component. With this decomposition, we show that the MPS problem for surfaces with a uniform material is equivalent to the conventional photometric stereo (CPS) with unknown light intensities. In this way, NeuralMPS reduces the difficulty of the non-Lambertian MPS problem by leveraging the well-studied non-Lambertian CPS methods. Experiments on both synthetic and real-world scenes demonstrate the effectiveness of our method.
2021
Xu Y, Li F, Chen Z, Liang J, Quan Y. Encoding Spatial Distribution of Convolutional Features for Texture Representation, in Advances in Neural Information Processing Systems (NeurIPS).; 2021.Abstract
With frame-based cameras, capturing fast-moving scenes without suffering from blur often comes at the cost of low SNR and low contrast. Worse still, the photometric constancy that enhancement techniques heavily relied on is fragile for frames with short exposure. Event cameras can record brightness changes at an extremely high temporal resolution. For low-light videos, event data are not only suitable to help capture temporal correspondences but also provide alternative observations in the form of intensity ratios between consecutive frames and exposure-invariant information. Motivated by this, we propose a low-light video enhancement method with hybrid inputs of events and frames. Specifically, a neural network is trained to establish spatiotemporal coherence between visual signals with different modalities and resolutions by constructing correlation volume across space and time. Experimental results on synthetic and real data demonstrate the superiority of the proposed method compared to the state-of-the-art methods.