This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
近些年來,深度學習領域出現了一大批能力、容量均不斷增長的架構。在不斷升級的硬件的支持下,今天的模型已經能夠輕鬆地消化數百萬張圖像,而且開始向數以億計的標記圖像進發。
在自然語言處理中,這種數據需求已經成功地通過自監督預訓練來解決。基於 GPT 自回歸語言建模和 BERT 掩蔽自編碼的解決方案在概念上非常簡單:它們刪除一部分數據,並學習預測刪除的內容。這些方法可以用來訓練包含數千億參數的可泛化 NLP 模型。
掩蔽自編碼器是一種更通用的去噪自編碼器,也適用於計算機視覺。其實,與視覺密切相關的研究早於 BERT。在 BERT 成功之後,人們對這一想法也產生了極大的興趣。但盡管如此,視覺自編碼方法的發展還是落後於 NLP。何愷明等研究者想知道:是什麼造成了這種差異?
他們嚐試從以下幾個角度來回答這一問題:
1、架構差異。在計算機視覺領域,卷積網絡是過去十年的主流架構。不過,隨著 Vision Transformers(ViT)的推出,這種架構上的差異已經逐漸縮小,應該不會再成為障礙。
2、信息密度差異。語言是人類產生的高度語義化信號,信息非常密集。當訓練一個模型來預測每個句子中缺失的寥寥數詞時,這項任務似乎能誘發複雜的語言理解。但視覺任務就不同了:圖像是自然信號,擁有大量的空間冗餘。例如,一個缺失的 patch 可以根據相鄰的 patch 恢複,而不需要對其他部分、對象和場景有很多的高級理解。
為了克服這種差異並鼓勵學習有用的特征,研究者展示了:一個簡單的策略在計算機視覺中也能非常有效:掩蔽很大一部分隨機 patch。這種策略在很大程度上減少了冗餘,並創造了一個具有挑戰性的自監督任務,該任務需要超越低級圖像統計的整體理解。下圖 2 - 圖 4 展示了這一重建任務的定性結果。
3、自編碼器的解碼器(將潛在表征映射回輸入)在文本和圖像重建任務中起著不同的作用。在計算機視覺任務中,解碼器重建的是像素,因此其輸出的語義水平低於一般的識別任務。這與語言相反,語言任務中的解碼器預測的是包含豐富語義信息的缺失單詞。雖然在 BERT 中,解碼器可能是微不足道的(一個 MLP),但何愷明等研究者發現,對於圖像,解碼器的設計對於學到的潛在表示的語義水平起著關鍵作用。
基於以上分析,研究者提出了一種簡單、有效且可擴展的掩蔽自編碼器(MAE)用於視覺表征學習。該 MAE 從輸入圖像中掩蔽了隨機 patch 並重建像素空間中缺失的 patch。它具有非對稱的編碼器 - 解碼器設計。其中,編碼器僅對 patch 的可見子集(沒有掩碼 token)進行操作,解碼器則是輕量級的,可以從潛在表征和掩碼 token 中重建輸入(圖 1)。
在這個非對稱編碼器 - 解碼器中,將掩碼 token 轉移到小型解碼器會導致計算量大幅減少。在這種設計下,非常高的掩蔽率(例如 75%)可以實現雙贏:它優化了準確性,同時允許編碼器僅處理一小部分(例如 25%)的 patch。這可以將整體預訓練時間減少至原來的 1/3 或更低,同時減少內存消耗,使我們能夠輕鬆地將 MAE 擴展到大型模型。
MAE 可以學習非常大容量的模型,而且泛化性能良好。通過 MAE 預訓練,研究者可以在 ImageNet-1K 上訓練 ViT-Large/-Huge 等需要大量數據的模型,提高泛化性能。例如,在 ImageNet-1K 數據集上,原始 ViT-Huge 模型經過微調後可以實現 87.8% 的準確率。這比以前所有僅使用 ImageNet-1K 數據的模型效果都要好。
CVPR2021一共有1663篇論文接受,如下: Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction
Bohan Wu, Suraj Nair, Roberto Martin-Martin, Li Fei-Fei, Chelsea Finn
[pdf] [supp]
[bibtex]
Over-the-Air Adversarial Flickering Attacks Against Video Recognition Networks
Roi Pony, Itay Naeh, Shie Mannor
[pdf] [supp] [arXiv]
[bibtex]
Person30K: A Dual-Meta Generalization Network for Person Re-Identification
Yan Bai, Jile Jiao, Wang Ce, Jun Liu, Yihang Lou, Xuetao Feng, Ling-Yu Duan
[pdf]
[bibtex]
Privacy Preserving Localization and Mapping From Uncalibrated Cameras
Marcel Geppert, Viktor Larsson, Pablo Speciale, Johannes L. Schonberger, Marc Pollefeys
[pdf] [supp]
[bibtex]
來自UIUC最新《自監督學習》教程,
神經語言生成(NLG)——使用神經網絡模型生成連貫的文本——是自動化文本創建最有前途的方法之一。近年來,隨著深度語境語言建模(如LSTMs、GPT、GPT2)和遷移學習(如ELMo、BERT)的發展,神經文本生成出現了範式轉變。雖然這些工具極大地改善了NLG的狀態,但是對於低資源任務,最先進的NLG模型仍然麵臨許多挑戰: 生成的文本缺乏多樣性,描述的情況違反常識性規則,使用事實信息的困難,以及設計可靠的評估指標的困難。在本教程中,我們將概述當前神經網絡架構的最新技術,以及它們如何形成文本生成的最新研究方向。我們將討論這些模型在生成連貫文本時如何以及為什麼成功或失敗,並對幾個應用程序提供見解。
目錄內容:
在一個常見的機器學習問題中,使用對訓練數據集估計的模型,根據觀察到的特征預測未來的結果值。當測試數據和訓練數據來自相同的分布時,許多學習算法被提出並證明是成功的。然而,對於給定的訓練數據分布,性能最好的模型通常會利用特征之間微妙的統計關係,這使得它們在應用於分布與訓練數據不同的測試數據時,可能更容易出現預測錯誤。對於學術研究和實際應用來說,如何開發能夠穩定和穩健地轉換數據的學習模型是至關重要的。
因果推理是指根據效果發生的條件得出因果關係的結論的過程,是一種強大的統計建模工具,用於解釋和穩定學習。本教程側重於因果推理和穩定學習,旨在從觀察數據中探索因果知識,提高機器學習算法的可解釋性和穩定性。首先,我們將介紹因果推論,並介紹一些最近的數據驅動方法,以估計因果效應從觀測數據,特別是在高維設置。為了彌補因果推理和機器學習之間的差距,我們首先給出了穩定性和魯棒性學習算法的定義,然後將介紹一些最近的穩定學習算法來提高預測的穩定性和可解釋性。最後,我們將討論穩定學習的應用和未來的發展方向,並提供穩定學習的基準。
主題:GANs in computer vision: Introduction to generative learning
主要內容:在這個綜述係列文章中,我們將重點討論計算機視覺應用程序的大量GANs。具體地說,我們將慢慢地建立在導致產生性對抗網絡(GAN)進化的思想和原則之上。我們將遇到不同的任務,如條件圖像生成,3D對象生成,視頻合成。
目錄:
一般來說,數據生成方法存在於各種各樣的現代深度學習應用中,從計算機視覺到自然語言處理。在這一點上,我們可以用肉眼生成幾乎無法區分的生成數據。生成性學習大致可分為兩大類:a)變分自編碼器(VAE)和b)生成性對抗網絡(GAN)。
自監督學習(Self-Supervised Learning)是一種介於無監督和監督學習之間的一種新範式,旨在減少對大量帶注釋數據的挑戰性需求。它通過定義無注釋(annotation-free)的前置任務(pretext task),為特征學習提供代理監督信號。jason718整理了關於自監督學習最新的論文合集,非常值得查看!
地址:https://github.com/jason718/awesome-self-supervised-learning
A curated list of awesome Self-Supervised Learning resources. Inspired byawesome-deep-vision,awesome-adversarial-machine-learning,awesome-deep-learning-papers, andawesome-architecture-search
Self-Supervised Learning has become an exciting direction in AI community.
Please help contribute this list by contactingmeor addpull request
Markdown format:
- Paper Name. [[pdf]](link) [[code]](link) - Author 1, Author 2, and Author 3. *Conference Year*
FAIR Self-Supervision Benchmark[repo]: various benchmark (and legacy) tasks for evaluating quality of visual representations learned by various self-supervision approaches.
Unsupervised Visual Representation Learning by Context Prediction.[pdf][code]
Unsupervised Learning of Visual Representations using Videos.[pdf][code]
Learning to See by Moving.[pdf][code]
Learning image representations tied to ego-motion.[pdf][code]
Joint Unsupervised Learning of Deep Representations and Image Clusters.[pdf][code-torch][code-caffe]
Unsupervised Deep Embedding for Clustering Analysis.[pdf][code]
Slow and steady feature analysis: higher order temporal coherence in video.[pdf]
Context Encoders: Feature Learning by Inpainting.[pdf][code]
Colorful Image Colorization.[pdf][code]
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles.[pdf][code]
Ambient Sound Provides Supervision for Visual Learning.[pdf][code]
Learning Representations for Automatic Colorization.[pdf][code]
Unsupervised Visual Representation Learning by Graph-based Consistent Constraints.[pdf][code]
Adversarial Feature Learning.[pdf][code]
Self-supervised learning of visual features through embedding images into text topic spaces.[pdf][code]
Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction.[pdf][code]
Learning Features by Watching Objects Move.[pdf][code]
Colorization as a Proxy Task for Visual Understanding.[pdf][code]
DeepPermNet: Visual Permutation Learning.[pdf][code]
Unsupervised Learning by Predicting Noise.[pdf][code]
Multi-task Self-Supervised Visual Learning.[pdf]
Representation Learning by Learning to Count.[pdf]
Transitive Invariance for Self-supervised Visual Representation Learning.[pdf]
Look, Listen and Learn.[pdf]
Unsupervised Representation Learning by Sorting Sequences.[pdf][code]
Unsupervised Feature Learning via Non-parameteric Instance Discrimination[pdf][code]
Learning Image Representations by Completing Damaged Jigsaw Puzzles.[pdf]
Unsupervised Representation Learning by Predicting Image Rotations.[pdf][code]
Learning Latent Representations in Neural Networks for Clustering through Pseudo Supervision and Graph-based Activity Regularization.[pdf][code]
Improvements to context based self-supervised learning.[pdf]
Self-Supervised Feature Learning by Learning to Spot Artifacts.[pdf][code]
Boosting Self-Supervised Learning via Knowledge Transfer.[pdf]
Cross-domain Self-supervised Multi-task Feature Learning Using Synthetic Imagery.[pdf][code]
ShapeCodes: Self-Supervised Feature Learning by Lifting Views to Viewgrids.[pdf]
Deep Clustering for Unsupervised Learning of Visual Features[pdf]
Cross Pixel Optical-Flow Similarity for Self-Supervised Learning.[pdf]
Representation Learning with Contrastive Predictive Coding.[pdf]
Self-Supervised Learning via Conditional Motion Propagation.[pdf][code]
Self-Supervised Representation Learning by Rotation Feature Decoupling.[pdf][code]
Revisiting Self-Supervised Visual Representation Learning.[pdf][code]
AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data.[pdf][code]
Unsupervised Deep Learning by Neighbourhood Discovery.[pdf].[code].
Contrastive Multiview Coding.[pdf][code]
Large Scale Adversarial Representation Learning.[pdf]
Learning Representations by Maximizing Mutual Information Across Views.[pdf][code]
Selfie: Self-supervised Pretraining for Image Embedding.[pdf]
Data-Efficient Image Recognition with Contrastive Predictive Coding[pdf]
Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty[pdf][code]
Boosting Few-Shot Visual Learning with Self-Supervision[pdf]
Self-Supervised Generalisation with Meta Auxiliary Learning[pdf][code]
Wasserstein Dependency Measure for Representation Learning[pdf][code]
Scaling and Benchmarking Self-Supervised Visual Representation Learning[pdf][code]
A critical analysis of self-supervision, or what we can learn from a single image[pdf][code]
On Mutual Information Maximization for Representation Learning[pdf][code]
Understanding the Limitations of Variational Mutual Information Estimators[pdf][code]
Automatic Shortcut Removal for Self-Supervised Representation Learning[pdf]
Momentum Contrast for Unsupervised Visual Representation Learning[pdf]
A Simple Framework for Contrastive Learning of Visual Representations[pdf]
ClusterFit: Improving Generalization of Visual Representations[pdf]
Self-Supervised Learning of Pretext-Invariant Representations[pdf]
Unsupervised Learning of Video Representations using LSTMs.[pdf][code]
Shuffle and Learn: Unsupervised Learning using Temporal Order Verification.[pdf][code]
LSTM Self-Supervision for Detailed Behavior Analysis[pdf]
Self-Supervised Video Representation Learning With Odd-One-Out Networks.[pdf]
Unsupervised Learning of Long-Term Motion Dynamics for Videos.[pdf]
Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning.[pdf]
Improving Spatiotemporal Self-Supervision by Deep Reinforcement Learning.[pdf]
Self-supervised learning of a facial attribute embedding from video.[pdf]
Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles.[pdf]
Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics.[pdf]
DynamoNet: Dynamic Action and Motion Network.[pdf]
Learning Correspondence from the Cycle-consistency of Time.[pdf][code]
Joint-task Self-supervised Learning for Temporal Correspondence.[pdf][code]
Self-supervised Learning of Motion Capture.[pdf][code][web]
Unsupervised Learning of Depth and Ego-Motion from Video.[pdf][code][web]
Active Stereo Net: End-to-End Self-Supervised Learning for Active Stereo Systems.[project]
Self-Supervised Relative Depth Learning for Urban Scene Understanding.[pdf][project]
Geometry-Aware Learning of Maps for Camera Localization.[pdf][code]
Self-supervised Learning of Geometrically Stable Features Through Probabilistic Introspection.[pdf][web]
Self-Supervised Learning of 3D Human Pose Using Multi-View Geometry.[pdf]
SelFlow: Self-Supervised Learning of Optical Flow.[pdf]
Unsupervised Learning of Landmarks by Descriptor Vector Exchange.[pdf][code][web]
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features.[pdf][code]
Objects that Sound.[pdf]
Learning to Separate Object Sounds by Watching Unlabeled Video.[pdf][project]
The Sound of Pixels.[pdf][project]
Learnable PINs: Cross-Modal Embeddings for Person Identity.[pdf][web]
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization.[pdf]
Self-Supervised Generation of Spatial Audio for 360° Video.[pdf]
TriCycle: Audio Representation Learning from Sensor Network Data Using Self-Supervision[pdf]
Self-taught Learning: Transfer Learning from Unlabeled Data.[pdf]
Representation Learning: A Review and New Perspectives.[pdf]
Curiosity-driven Exploration by Self-supervised Prediction.[pdf][code]
Large-Scale Study of Curiosity-Driven Learning.[pdf]
Playing hard exploration games by watching YouTube.[pdf]
Unsupervised State Representation Learning in Atari.[pdf][code]
Improving Robot Navigation Through Self-Supervised Online Learning[pdf]
Reverse Optical Flow for Self-Supervised Adaptive Autonomous Robot Navigation[pdf]
Online self-supervised learning for dynamic object segmentation[pdf]
Self-Supervised Online Learning of Basic Object Push Affordances[pdf]
Self-supervised learning of grasp dependent tool affordances on the iCub Humanoid robot[pdf]
Persistent self-supervised learning principle: from stereo to monocular vision for obstacle avoidance[pdf]
The Curious Robot: Learning Visual Representations via Physical Interactions.[pdf]
Learning to Poke by Poking: Experiential Learning of Intuitive Physics.[pdf]
Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours.[pdf]
Supervision via Competition: Robot Adversaries for Learning Tasks.[pdf]
Multi-view Self-supervised Deep Learning for 6D Pose Estimation in the Amazon Picking Challenge.[pdf][Project]
Combining Self-Supervised Learning and Imitation for Vision-Based Rope Manipulation.[pdf][Project]
Learning to Fly by Crashing[pdf]
Self-supervised learning as an enabling technology for future space exploration robots: ISS experiments on monocular distance learning[pdf]
Unsupervised Perceptual Rewards for Imitation Learning.[pdf][project]
Self-Supervised Visual Planning with Temporal Skip Connections.[pdf]
CASSL: Curriculum Accelerated Self-Supervised Learning.[pdf]
Time-Contrastive Networks: Self-Supervised Learning from Video.[pdf][Project]
Self-Supervised Deep Reinforcement Learning with Generalized Computation Graphs for Robot Navigation.[pdf]
Learning Actionable Representations from Visual Observations.[pdf][Project]
Learning Synergies between Pushing and Grasping with Self-supervised Deep Reinforcement Learning.[pdf][Project]
Visual Reinforcement Learning with Imagined Goals.[pdf][Project]
Grasp2Vec: Learning Object Representations from Self-Supervised Grasping.[pdf][Project]
Robustness via Retrying: Closed-Loop Robotic Manipulation with Self-Supervised Learning.[pdf][Project]
Learning Long-Range Perception Using Self-Supervision from Short-Range Sensors and Odometry.[pdf]
Learning Latent Plans from Play.[pdf][Project]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.[pdf][link]
Self-Supervised Dialogue Learning[pdf]
Self-Supervised Learning for Contextualized Extractive Summarization[pdf]
A Mutual Information Maximization Perspective of Language Representation Learning[pdf]
VL-BERT: Pre-training of Generic Visual-Linguistic Representations[pdf][code]
Learning Robust and Multilingual Speech Representations[pdf]
Unsupervised pretraining transfers well across languages[pdf][code]
wav2vec: Unsupervised Pre-Training for Speech Recognition[pdf][code]
vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations[pdf]
Effectiveness of self-supervised pre-training for speech recognition[pdf]
Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning[pdf]
Self-Training for End-to-End Speech Recognition[pdf]
Generative Pre-Training for Speech with Autoregressive Predictive Coding[pdf][code]
To the extent possible under law,Zhongzheng Renhas waived all copyright and related or neighboring rights to this work.
目錄
文章名字
NLP Transfer Learning In 3 Steps
文章簡介
BERT(Devlin等人,2018)可能是最流行的NLP遷移學習方法。Huggingface的實現提供了許多不錯的特性,並在漂亮的API背後抽象出了細節。PyTorch Lightning是一個輕量級框架(實際上更像是重構PyTorch代碼),它允許使用PyTorch的任何人(如學生、研究人員和生產團隊)輕鬆擴展深度學習代碼,同時使其可重複。它還通過教練旗提供42+項高級研究功能。閃電沒有添加抽象的PyTorch,這意味著它與其他偉大的包,如擁抱臉玩得很好!在本教程中,我們將使用它們的BERT實現在Lightning中執行微調任務。在本教程中,我們將通過3個步驟為NLP進行遷移學習: 我們將從huggingface圖書館導入BERT。 我們將創建一個LightningModule,它使用BERT提取的特征進行微調 我們將使用燈光教練機訓練BertMNLIFinetuner。
文章作者
William Falcon,博士生,人工智能(紐約大學,Facebook人工智能研究)。最近一直致力於自然語言預訓練模型研究,並取得了最大突破。主張機器學習要麵向實踐,麵向實際,立誌解決當前問題,AI必須要有商業驅動,方能足夠長遠的發展。
*《Stabilizing Transformers for Reinforcement Learning》E Parisotto, H. F Song, J W. Rae, R Pascanu, C Gulcehre, S M. Jayakumar, M Jaderberg, R L Kaufman, A Clark, S Noury, M M. Botvinick, N Heess, R Hadsell [DeepMind] (2019)