Song Y, Wang J, Ma L, Yu J, Liang J, Yuan L, Yu Z.
MARN: Multi-level Attentional Reconstruction Networks for Weakly Supervised Video Temporal Grounding. Neurocomputing. 2022;554:126625.
AbstractVideo temporal grounding is a challenging task in computer vision that involves localizing a video segment semantically related to a given query from a set of videos and queries. In this paper, we propose a novel weakly-supervised model called the Multi-level Attentional Reconstruction Networks (MARN), which is trained on video-sentence pairs. During the training phase, we leverage the idea of attentional reconstruction to train an attention map that can reconstruct the given query. At inference time, proposals are ranked based on attention scores to localize the most suitable segment. In contrast to previous methods, MARN effectively aligns video-level supervision and proposal scoring, thereby reducing the training-inference discrepancy. In addition, we incorporate a multi-level framework that encompasses both proposal-level and clip-level processes. The proposal-level process generates and scores variable-length time sequences, while the clip-level process generates and scores fix-length time sequences to refine the predicted scores of the proposal in both training and testing. To improve the feature representation of the video, we propose a novel representation mechanism that utilizes intra-proposal information and adopts 2D convolution to extract inter-proposal clues for learning reliable attention maps. By accurately representing these proposals, we can better align them with the textual modalities, and thus facilitate the learning of the model. Our proposed MARN is evaluated on two benchmark datasets, and extensive experiments demonstrate its superiority over existing methods.
Liang J, Xu Y, Quan Y, Shi B, Ji H.
Self-Supervised Low-Light Image Enhancement Using Discrepant Untrained Network Priors. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). 2022;32:7332–7345.
AbstractThis paper proposes a deep learning method for low-light image enhancement, which exploits the generation capability of Neural Networks (NNs) while requiring no training samples except the input image itself. Based on the Retinex decomposition model, the reflectance and illumination of a low-light image are parameterized by two untrained NNs. The ambiguity between the two layers is resolved by the discrepancy between the two NNs in terms of architecture and capacity, while the complex noise with spatially-varying characteristics is handled by an illumination-adaptive self-supervised denoising module. The enhancement is done by jointly optimizing the Retinex decomposition and the illumination adjustment. Extensive experiments show that the proposed method not only outperforms existing non-learning-based and unsupervised-learning-based methods, but also competes favorably with some supervised-learning-based methods in extreme low-light conditions.