Cognitive Mapping and Planning for Visual Navigation

We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person viewpoints and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the planner, and b) a spatial memory with the ability to plan given an incomplete set of observations about the world. CMP constructs a top-down belief map of the world and applies a differentiable neural net planner to produce the next action at each time step. The accumulated belief of the world enables the agent to track visited regions of the environment. Our experiments demonstrate that CMP outperforms both reactive strategies and standard memory-based architectures and performs well in novel environments. Furthermore, we show that CMP can also achieve semantically specified goals, such as 'go to a chair'.

Comments: Under review for CVPR 2017. Project webpage: https://sites.google.com/view/cognitive-mapping-and-planning/

Similar Publications

This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 64k movie clips with actions localized in space and time, resulting in 197k action labels with multiple labels per human occurring frequently. The main differences with existing video datasets are: (1) the definition of atomic visual actions, which avoids collecting data for each and every complex action; (2) precise spatio-temporal annotations with possibly multiple annotations for each human; (3) the use of diverse, realistic video material (movies). Read More


Generic text embeddings are successfully used in a variety of tasks. However, they are often learnt by capturing the co-occurrence structure from pure text corpora, resulting in limitations of their ability to generalize. In this paper, we explore models that incorporate visual information into the text representation. Read More


We present a powerful method to extract per-point semantic class labels from aerialphotogrammetry data. Labeling this kind of data is important for tasks such as environmental modelling, object classification and scene understanding. Unlike previous point cloud classification methods that rely exclusively on geometric features, we show that incorporating color information yields a significant increase in accuracy in detecting semantic classes. Read More


Evaluating expression of the Human epidermal growth factor receptor 2 (Her2) by visual examination of immunohistochemistry (IHC) on invasive breast cancer (BCa) is a key part of the diagnostic assessment of BCa due to its recognised importance as a predictive and prognostic marker in clinical practice. However, visual scoring of Her2 is subjective and consequently prone to inter-observer variability. Given the prognostic and therapeutic implications of Her2 scoring, a more objective method is required. Read More


This paper proposes a novel formulation for the multi-object tracking-by-detection paradigm for two (or more) input detectors. Using full-body and heads detections, the fusion helps to recover heavily occluded persons and to reduce false positives. The assignment of the two input features to a person and the extraction of the trajectories is commonly solved from one binary quadratic program (BQP). Read More


We address the problem of estimating image difficulty defined as the human response time for solving a visual search task. We collect human annotations of image difficulty for the PASCAL VOC 2012 data set through a crowd-sourcing platform. We then analyze what human interpretable image properties can have an impact on visual search difficulty, and how accurate are those properties for predicting difficulty. Read More


Shot boundary detection (SBD) is an important component of many video analysis tasks, such as action recognition, video indexing, summarization and editing. Previous work typically used a combination of low-level features like color histograms, in conjunction with simple models such as SVMs. Instead, we propose to learn shot detection end-to-end, from pixels to final shot boundaries. Read More


Salient object detection has increasingly become a popular topic in cognitive and computational sciences, including computer vision and artificial intelligence research. In this paper, we propose integrating \textit{semantic priors} into the salient object detection process. Our algorithm consists of three basic steps. Read More


We propose a novel framework for abnormal event detection in video that requires no training sequences. Our framework is based on unmasking, a technique previously used for authorship verification in text documents, which we adapt to our task. We iteratively train a binary classifier to distinguish between two consecutive video sequences while removing at each step the most discriminant features. Read More


Domain adaptation techniques address the problem of reducing the sensitivity of machine learning methods to the so-called domain shift, namely the difference between source (training) and target (test) data distributions. In particular, unsupervised domain adaptation assumes no labels are available in the target domain. To this end, aligning second order statistics (covariances) of target and source domains have proven to be an effective approach ti fill the gap between the domains. Read More