受BERT的成功啟發，人們提出了幾種聯合表示圖像和文本的多模態表示學習方法。這些方法通過從大規模的多模態預訓練中獲取高級語義信息來獲得更好的性能。其中LXMERT和UNITER采用視覺區域特征回歸和標簽分類作為前置任務。然而，在語義標注有限且不一致的眾包數據集上預先訓練的視覺特征往往存在標簽噪聲過大和語義標注稀疏的問題。為了克服這些問題，我們提出了無偏密集對比視覺語言預訓練(unbiased Dense contrast visual - language Pretraining, DCVLP)，它用不需要注釋的跨通道區域對比學習代替區域回歸和分類。為了提高對比學習中負樣本的質量，我們提出了兩種數據增強策略(掩模擾動和對抗內/對抗間擾動)。總之，DCVLP允許在獨立於任何對象注釋的自監督設置中跨模態密集區域對比學習。我們將該方法與以往的視覺-語言前訓練框架進行了比較，驗證了密集對比學習在多模態表征學習中的優越性。
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.
Learning a good representation for space-time correspondence is the key for various computer vision tasks, including tracking object bounding boxes and performing video object pixel segmentation. To learn generalizable representation for correspondence in large-scale, a variety of self-supervised pretext tasks are proposed to explicitly perform object-level or patch-level similarity learning. Instead of following the previous literature, we propose to learn correspondence using Video Frame-level Similarity (VFS) learning, i.e, simply learning from comparing video frames. Our work is inspired by the recent success in image-level contrastive learning and similarity learning for visual recognition. Our hypothesis is that if the representation is good for recognition, it requires the convolutional features to find correspondence between similar objects or parts. Our experiments show surprising results that VFS surpasses state-of-the-art self-supervised approaches for both OTB visual object tracking and DAVIS video object segmentation. We perform detailed analysis on what matters in VFS and reveals new properties on image and frame level similarity learning. Project page with code is available at https://jerryxu.net/VFS