【自監督系列】首次探究畫素級別的自監督任務

Title

： Propagate Yourself： Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning

作者

：Zhenda Xie， Yutong Lin， Zheng Zhang， Yue Cao， Stephen Lin3， Han Hu

發表單位

：Tsinghua University，Xi’an Jiaotong University，Microsoft Research Asia

發表於

：arxiv 2020

關鍵詞

：自監督，pixel-level一致性

一句話總結

：設計pixel-level的自監督學習；一是將鄰近的pixel設定為positive pairs做對比學習，保持spatial sensitivity，二是利用nonlocal layer和一致性loss，保持spatial smoothness；

0. Abstract

任務

：自監督、對比學習

提出問題

：

目前的對比學習主要基於instance-level pretext tasks，對於dense級別的下游任務而言是次優的特徵。

解決方法

：

提出了pixel-level pretext tasks去做dense feature representation；

是第一個在pixel-level做對比學習的模型；

提出了a pixel-to-propagation consistency task，相比instance-level對比學習，檢測、分割的實驗效果很好。結果表明，pixel-level自監督不僅對常規backbone網路的預訓練有效，而且對用於dense downstream任務的head網路的預訓練也有效，是instance-level對比方法的補充。

defining pretext tasks at the pixel level 是有潛力的！

1. Introduction

提出問題：

目前的對比學習都是instance級別的，區分的是不同image：In computer vision， recent advances can largely be ascribed to the use of a pretext task called

instance discrimination

， which treats each image in a training set as a single class and aims to learn a feature representation that discriminates among all the classes。作者認為，instance-level的預訓練可能對影象分類有用，但如果下游任務是分割等細粒度任務，instance-level學到的feature是缺乏空間關聯性的！確實如此。

解決方法：

construct a pixel-level contrastive learning task，將一張影象上的每個pixel看作a single class，目標是去區分同一張image上不同的pixel。同一個pixel上的特徵如何提取？使用兩個包含該pixel的隨機image crops。得到的這兩種features是同一個pixel不同的表達，作為正樣本對。來自不同pixel的features作為負樣本對。在此基礎上計算對比loss，我們稱之為

PixContrast loss

；

基於pixel-to-propagation consistency提出了一個新方法，positive pairs是透過兩條不對稱的pipeline從同一畫素中提取特徵來獲得的。第一個pipeline就是標準的backbone+projection head；第二個pipeline與第一個pipeline類似，但是多了一個pixel propagation module，這個模組透過傳播相似畫素的特徵來過濾畫素的特徵，這個模組起到了平滑作用，讓特徵相似的pixel之間做了資訊交換，其實就是平均，跟GCN很像啊。我們稱之為

PixPro

，這是一種consistency-based pretext task。

這個方法與對比學習PixContrast 不同，並不需要negative pairs，而只是關注consistency。從實驗來看，PixPro是比PixContrast效果更好的！

作者認為，pixel-level的方法是對instance-level方法的補充，instance-level方法擅長學習類別資訊、整體特徵，pixel-level方法擅長學習spatially sensitive representation，兩者結合是最好的，而且計算高效，兩個任務是共享一個backbone encoder的。

作者還在related work介紹了Instance Discrimination和Other Pretext Tasks，可以參考。

2. Method

2.1 Pixel-level Contrastive Learning

將之前的contrastive loss用於pixel-level，命名為PixContrast。

a regular encoder network：

a backbone network（resnet）+a projection head network（2個1*1卷積層）；

a momentum encoder network：

兩個網路的輸出是一個feature map，便於用於pixel-level對比學習。首先將兩個feature map對映到原始的image space中，然後計算每個pixel對之間的距離。

Pixel Contrast 如何定義？

根據空間位置定義距離，根據距離定義正、負樣本對，然後根據特徵向量餘弦相似度去定義loss。

distance首先被歸一化；將距離歸一化為feature map的對角線長度，以考慮增強檢視之間的比例差異；The distances are normalized to the diagonal length of a feature map bin to account for differences in scale between the augmentation views；這裡沒看懂？怎麼歸一化的？猜想這裡的距離是根據空間距離計算的，預設相鄰的pixel是正樣本對，距離遠的是負樣本對。

Thespatial sensitivity and spatial smoothnessof a learnt representation may affect transfer performance on downstream tasks requiring dense prediction。

The former measures the ability to discriminate spatially close pixels， needed for accurate prediction in boundary areas where labels change。

spatial sensitivity側重的是細節、高頻，差異；

The latter property encourages spatially close pixels to be similar， which can aid prediction in areas thatbelong to the same label。

spatial smoothness側重的是class-in類內的一致性，跟GCN一樣。

作者認為，PixContrast只能增強spatial sensitivity，於是設計了pixel-to-propagation consistency (PPC) 模組去增強spatial smoothness。

PPC一共包括兩部分：

1）a pixel propagation module：它相當於在做feature denoising/smoothing=nonlocal means=GCN；

2）an asymmetric architecture：一個branch生成regular feature map，另一個branch incorporates the pixel-propagation module。

任務就是使這兩個branch輸出的feature map保持一致性。

優點：

1）透過regular branch保持了一定的spatial sensitivity；

2）根據兩個的一致性設計loss，不需要negative pairs。

所謂的pixel propagation module其實就是一個nonlocal layer！

Pixel-to-Propagation Consistency Loss ：

In the asymmetric architecture design， there are two different encoders：

a regular encoder with the pixel propagation module applied afterwards to produce smoothed features， and a momentum encoder without the propagation module。 The two augmentation views both pass through the two encoders， and the features from different encoders are encouraged to be consistent：

PPC與PixContrast 的區別：

1）引進了a pixel propagation module （PPM），

2）把contrastive loss 換成了 consistency loss。

2.3 Aligning Pre-training to Downstream Networks

之前的instance-level對比學習都只是預訓練encoder，我們還對head network預訓練了，效果顯著。Pre-training the FPN layers and the head networks。

2.4 Combined with Instance Contrast

透過共享data loader and backbone encoders，可以同時使用instance contrast，the instance-level pretext task is applied on the output of the res5 stage， using projection heads that are independent of the pixel-level task。 Here， we use a popular instance-level method， SimCLR ［8］， with a momentum encoder to be aligned with the pixel-level pretext task。

兩個loss加起來一起用即可！

3. Experiments

資料增強怎麼做的？

In pre-training， the data augmentation strategy follows ［17］， wheretwo random cropsfrom the image are independently sampled and resized to 224 × 224 with a random horizontal flip， followed by color distortion， Gaussian blur， and a solarization operation。

We skip the loss computation for cropped pairs with no overlaps， which compose only a small fraction of all the cropped pairs。

對兩個crop不存在重疊的樣本，直接捨棄！

原文：

文中多為圖片格式，原始版本的內容見下面的公眾號：

公眾號名稱：計算機視覺與數字影象處理

微訊號：cv_and_dip

小蜜蜂問答

小蜜蜂問答

【自監督系列】首次探究畫素級別的自監督任務

推薦文章

小蜜蜂問答

小蜜蜂問答

【自監督系列】首次探究畫素級別的自監督任務

相關文章

A-level到底有多少種？怎麼考？

Google推出新手機Pixel，能夠真的威脅到iPhone嗎？

semi-supervised classification with GCN：原始碼筆記

(一)機器學習資料處理

推薦文章