Unsupervised Visual Representation Learning by Context Prediction

This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. Given only a large, unlabeled image collection, we extract random pairs of patches from each image and train a convolutional neural net to predict the position of the second patch relative to the first. We argue that doing well on this task requires the model to learn to recognize objects and their parts. We demonstrate that the feature representation learned using this within-image context indeed captures visual similarity across images. For example, this representation allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset. Furthermore, we show that the learned ConvNet can be used in the R-CNN framework and provides a significant boost over a randomly-initialized ConvNet, resulting in state-of-the-art performance among algorithms which use only Pascal-provided training set annotations.

Comments: Oral paper at ICCV 2015

Similar Publications

Softmax loss is widely used in deep neural networks for multi-class classification, where each class is represented by a weight vector, a sample is represented as a feature vector, and the feature vector has the largest projection on the weight vector of the correct category when the model correctly classifies a sample. To ensure generalization, weight decay that shrinks the weight norm is often used as regularizer. Different from traditional learning algorithms where features are fixed and only weights are tunable, features are also tunable as representation learning in deep learning. Read More

Many model-based Visual Odometry (VO) algorithms have been proposed in the past decade, often restricted to the type of camera optics, or the underlying motion manifold observed. We envision robots to be able to learn and perform these tasks, in a minimally supervised setting, as they gain more experience. To this end, we propose a fully trainable solution to visual ego-motion estimation for varied camera optics. Read More

Person recognition methods that use multiple body regions have shown significant improvements over traditional face-based recognition. One of the primary challenges in full-body person recognition is the extreme variation in pose and view point. In this work, (i) we present an approach that tackles pose variations utilizing multiple models that are trained on specific poses, and combined using pose-aware weights during testing. Read More

For crowded scenes, the accuracy of object-based computer vision methods declines when the images are low-resolution and objects have severe occlusions. Taking counting methods for example, almost all the recent state-of-the-art counting methods bypass explicit detection and adopt regression-based methods to directly count the objects of interest. Among regression-based methods, density map estimation, where the number of objects inside a subregion is the integral of the density map over that subregion, is especially promising because it preserves spatial information, which makes it useful for both counting and localization (detection and tracking). Read More

A routine task for art historians is painting diagnostics, such as dating or attribution. Signal processing of the X-ray image of a canvas provides useful information about its fabric. However, previous methods may fail when very old and deteriorated artworks or simply canvases of small size are studied. Read More

Given the recent successes of deep learning applied to style transfer and texture synthesis, we propose a new theoretical framework to construct visual metamers: \textit{a family of perceptually identical, yet physically different images}. We review work both in neuroscience related to metameric stimuli, as well as computer vision research in style transfer. We propose our NeuroFovea metamer model that is based on a mixture of peripheral representations and style transfer forward-pass algorithms for \emph{any} image from the recent work of Adaptive Instance Normalization (Huang~\&~Belongie). Read More

Part-based representation has been proven to be effective for a variety of visual applications. However, automatic discovery of discriminative parts without object/part-level annotations is challenging. This paper proposes a discriminative mid-level representation paradigm based on the responses of a collection of part detectors, which only requires the image-level labels. Read More

Coded Aperture has been used for recovering all in focus image and a layered depth map simultaneously using a single captured image. The non trivial task of finding an optimal code has driven researchers to make some simplifying assumptions over image distribution which may not hold under all practical scenarios. In this work we propose a data driven approach to find the optimal code for depth recovery. Read More

We propose an effective online background subtraction method, which can be robustly applied to practical videos that have variations in both foreground and background. Different from previous methods which often model the foreground as Gaussian or Laplacian distributions, we model the foreground for each frame with a specific mixture of Gaussians (MoG) distribution, which is updated online frame by frame. Particularly, our MoG model in each frame is regularized by the learned foreground/background knowledge in previous frames. Read More

Many inverse problems involve two or more sets of variables that represent different physical quantities but are tightly coupled with each other. For example, image super-resolution requires joint estimation of image and motion parameters from noisy measurements. Exploiting this structure is key for efficiently solving large-scale problems to avoid, e. Read More