Computer Science - Computer Vision and Pattern Recognition Publications (50)


Computer Science - Computer Vision and Pattern Recognition Publications

Visual scene decomposition into semantic entities is one of the major challenges when creating a reliable object grasping system. Recently, we introduced a bottom-up hierarchical clustering approach which is able to segment objects and parts in a scene. In this paper, we introduce a transform from such a segmentation into a corresponding, hierarchical saliency function. Read More

The proposed Earth observation (EO) based value adding system (EO VAS), hereafter identified as AutoCloud+, consists of an innovative EO image understanding system (EO IUS) design and implementation capable of automatic spatial context sensitive cloud/cloud shadow detection in multi source multi spectral (MS) EO imagery, whether or not radiometrically calibrated, acquired by multiple platforms, either spaceborne or airborne, including unmanned aerial vehicles (UAVs). It is worth mentioning that the same EO IUS architecture is suitable for a large variety of EO based value adding products and services, including: (i) low level image enhancement applications, such as automatic MS image topographic correction, co registration, mosaicking and compositing, (ii) high level MS image land cover (LC) and LC change (LCC) classification and (iii) content based image storage/retrieval in massive multi source EO image databases (big data mining). Read More

We introduce a library of geometric voxel features for CAD surface recognition/retrieval tasks. Our features include local versions of the intrinsic volumes (the usual 3D volume, surface area, integrated mean and Gaussian curvature) and a few closely related quantities. We also compute Haar wavelet and statistical distribution features by aggregating raw voxel features. Read More

The Aduio-visual Speech Recognition (AVSR) which employs both the video and audio information to do Automatic Speech Recognition (ASR) is one of the application of multimodal leaning making ASR system more robust and accuracy. The traditional models usually treated AVSR as inference or projection but strict prior limits its ability. As the revival of deep learning, Deep Neural Networks (DNN) becomes an important toolkit in many traditional classification tasks including ASR, image classification, natural language processing. Read More

This paper proposes a novel method to optimize bandwidth usage for object detection in critical communication scenarios. We develop two operating models of active information seeking. The first model identifies promising regions in low resolution imagery and progressively requests higher resolution regions on which to perform recognition of higher semantic quality. Read More

The robustness and security of the biometric watermarking approach can be improved by using a multiple watermarking. This multiple watermarking proposed for improving security of biometric features and data. When the imposter tries to create the spoofed biometric feature, the invisible biometric watermark features can provide appropriate protection to multimedia data. Read More

Current self-driving car systems operate well in sunny weather but struggle in adverse conditions. One of the most commonly encountered adverse conditions involves water on the road caused by rain, sleet, melting snow or flooding. While some advances have been made in using conventional RGB camera and LIDAR technology for detecting water hazards, other sources of information such as polarization offer a promising and potentially superior approach to this problem in terms of performance and cost. Read More

We study characteristics of receptive fields of units in deep convolutional networks. The receptive field size is a crucial issue in many visual tasks, as the output must respond to large enough areas in the image to capture information about large objects. We introduce the notion of an effective receptive field, and show that it both has a Gaussian distribution and only occupies a fraction of the full theoretical receptive field. Read More

We present a novel algorithm for light source estimation in scenes reconstructed with a RGB-D camera based on an analytically-derived formulation of path-tracing. Our algorithm traces the reconstructed scene with a custom path-tracer and computes the analytical derivatives of the light transport equation from principles in optics. These derivatives are then used to perform gradient descent, minimizing the photometric error between one or more captured reference images and renders of our current lighting estimation using an environment map parameterization for light sources. Read More

Deep neural networks have recently achieved significant progress. Sharing trained models of these deep neural networks is very important in the rapid progress of researching or developing deep neural network systems. At the same time, it is necessary to protect the rights of shared trained models. Read More

Tensor principal component analysis (TPCA) is a multi-linear extension of principal component analysis which converts a set of correlated measurements into several principal components. In this paper, we propose a new robust TPCA method to extract the princi- pal components of the multi-way data based on tensor singular value decomposition. The tensor is split into a number of blocks of the same size. Read More

In conventional sparse representations based dictionary learning algorithms, initial dictionaries are generally assumed to be proper representatives of the system at hand. However, this may not be the case, especially in some systems restricted to random initializations. Therefore, a supposedly optimal state-update based on such an improper model might lead to undesired effects that will be conveyed to successive iterations. Read More

Breast cancer is becoming increasing pervasive, and its early detection is an important step in saving life of any patient. Mammography is an important tool in breast cancer diagnosis. The most important step here is classification of mammogram patch as normal-abnormal and benign-malignant. Read More

We describe a framework to build distances by measuring the tightness of inequalities, and introduce the notion of proper statistical divergences and improper pseudo-divergences. We then consider the H\"older ordinary and reverse inequalities, and present two novel classes of H\"older divergences and pseudo-divergences that both encapsulate the special case of the Cauchy-Schwarz divergence. We report closed-form formulas for those statistical dissimilarities when considering distributions belonging to the same exponential family provided that the natural parameter space is a cone (e. Read More

In recent years, there has been renewed interest in developing methods for skeleton-based human action recognition. A skeleton sequence can be naturally represented as a high-order tensor time series. In this paper, we model and analyze tensor time series with Linear Dynamical System (LDS) which is the most common for encoding spatio-temporal time-series data in various disciplines dut to its relative simplicity and efficiency. Read More

Recent advances in using quantitative ultrasound (QUS) methods have provided a promising framework to non-invasively and inexpensively monitor or predict the effectiveness of therapeutic cancer responses. One of the earliest steps in using QUS methods is contouring a region of interest (ROI) inside the tumour in ultrasound B-mode images. While manual segmentation is a very time-consuming and tedious task for human experts, auto-contouring is also an extremely difficult task for computers due to the poor quality of ultrasound B-mode images. Read More

This paper describes the development of a novel algorithm to tackle the problem of real-time video stabilization for unmanned aerial vehicles (UAVs). There are two main components in the algorithm: (1) By designing a suitable model for the global motion of UAV, the proposed algorithm avoids the necessity of estimating the most general motion model, projective transformation, and considers simpler motion models, such as rigid transformation and similarity transformation. (2) To achieve a high processing speed, optical-flow based tracking is employed in lieu of conventional tracking and matching methods used by state-of-the-art algorithms. Read More

This paper aims to develop a novel cost-effective framework for face identification, which progressively maintains a batch of classifiers with the increasing face images of different individuals. By naturally combining two recently rising techniques: active learning (AL) and self-paced learning (SPL), our framework is capable of automatically annotating new instances and incorporating them into training under weak expert re-certification. We first initialize the classifier using a few annotated samples for each individual, and extract image features using the convolutional neural nets. Read More

Recent successes in learning-based image classification, however, heavily rely on the large number of annotated training samples, which may require considerable human efforts. In this paper, we propose a novel active learning framework, which is capable of building a competitive classifier with optimal feature representation via a limited amount of labeled training instances in an incremental learning manner. Our approach advances the existing active learning methods in two aspects. Read More

Convolutional neural nets (CNNs) have become a practical means to perform vision tasks, particularly in the area of image classification. FPGAs are well known to be able to perform convolutions efficiently, however, most recent efforts to run CNNs on FPGAs have shown limited advantages over other devices such as GPUs. Previous approaches on FPGAs have often been memory bound due to the limited external memory bandwidth on the FPGA device. Read More

We consider generation and comprehension of natural language referring expression for objects in an image. Unlike generic "image captioning" which lacks natural standard evaluation criteria, quality of a referring expression may be measured by the receiver's ability to correctly infer which object is being described. Following this intuition, we propose two approaches to utilize models trained for comprehension task to generate better expressions. Read More

In this paper, we propose a new joint dictionary learning method for example-based image super-resolution (SR), using sparse representation. The low-resolution (LR) dictionary is trained from a set of LR sample image patches. Using the sparse representation coefficients of these LR patches over the LR dictionary, the high-resolution (HR) dictionary is trained by minimizing the reconstruction error of HR sample patches. Read More

Binarized neural networks (BNNs) are gaining interest in the deep learning community due to their significantly lower computational and memory cost. They are particularly well suited to reconfigurable logic devices, which contain an abundance of fine-grained compute resources and can result in smaller, lower power implementations, or conversely in higher classification rates. Towards this end, the Finn framework was recently proposed for building fast and flexible field programmable gate array (FPGA) accelerators for BNNs. Read More

Edge detection is a classic problem in the field of image processing, which lays foundations for other tasks such as image segmentation. Conventionally, this operation is performed using gradient operators such as the Roberts or Sobel operator, which can discover local changes in intensity levels. These operators, however, perform poorly on low contrast images. Read More

The increasing prevalence of diet-related chronic diseases coupled with the ineffectiveness of traditional diet management methods have resulted in a need for novel tools to accurately and automatically assess meals. Recently, computer vision based systems that use meal images to assess their content have been proposed. Food portion estimation is the most difficult task for individuals assessing their meals and it is also the least studied area. Read More

This paper presents a novel mathematical framework for representing uncertainty in large deformation diffeomorphic image registration. The Bayesian posterior distribution over the deformations aligning a moving and a fixed image is approximated via a variational formulation. A stochastic differential equation (SDE) modeling the deformations as the evolution of a time-varying velocity field leads to a prior density over deformations in the form of a Gaussian process. Read More

Training of Convolutional Neural Networks (CNNs) on long video sequences is computationally expensive due to the substantial memory requirements and the massive number of parameters that deep architectures demand. Early fusion of video frames is thus a standard technique, in which several consecutive frames are first agglomerated into a compact representation, and then fed into the CNN as an input sample. For this purpose, a summarization approach that represents a set of consecutive RGB frames by a single dynamic image to capture pixel dynamics is proposed recently. Read More

Atmosphere light value is a highly critical parameter in defogging algorithms that are based on an atmosphere scattering model. Any error in atmosphere light value will produce a direct impact on the accuracy of scattering computation and thus bring chromatic distortion to restored images. To address this problem, this paper propose a method that relies on clustering statistics to estimate atmosphere light value. Read More

Re-identification is generally carried out by encoding the appearance of a subject in terms of outfit, suggesting scenarios where people do not change their attire. In this paper we overcome this restriction, by proposing a framework based on a deep convolutional neural network, SOMAnet, that additionally models other discriminative aspects, namely, structural attributes of the human figure (e.g. Read More

Structural learning, a method to estimate the parameters for discrete energy minimization, has been proven to be effective in solving computer vision problems, especially in 3D scene parsing. As the complexity of the models increases, structural learning algorithms turn to approximate inference to retain tractability. Unfortunately, such methods often fail because the approximation can be arbitrarily poor. Read More

Currently successful methods for video description are based on encoder-decoder sentence generation using recur-rent neural networks (RNNs). Recent work has shown the advantage of integrating temporal and/or spatial attention mechanisms into these models, in which the decoder net-work predicts each word in the description by selectively giving more weight to encoded features from specific time frames (temporal attention) or to features from specific spatial regions (spatial attention). In this paper, we propose to expand the attention model to selectively attend not just to specific times or spatial regions, but to specific modalities of input such as image features, motion features, and audio features. Read More

We present a loss function which can be viewed as a generalization of many popular loss functions used in robust statistics: the Cauchy/Lorentzian, Welsch, and generalized Charbonnier loss functions (and by transitivity the L2, L1, L1-L2, and pseudo-Huber/Charbonnier loss functions). We describe and visualize this loss, and document several of its useful properties. Read More

Convolutional neural networks have been applied to a wide variety of computer vision tasks. Recent advances in semantic segmentation have enabled their application to medical image segmentation. While most CNNs use two-dimensional kernels, recent CNN-based publications on medical image segmentation featured three-dimensional kernels, allowing full access to the three-dimensional structure of medical images. Read More

Limited annotated data available for the recognition of facial expression and action units embarrasses the training of deep networks, which can learn disentangled invariant features. However, a linear model with just several parameters normally is not demanding in terms of training data. In this paper, we propose an elegant linear model to untangle confounding factors in challenging realistic multichannel signals such as 2D face videos. Read More

We propose an image smoothing approximation and intrinsic image decomposition method based on a modified convolutional neural network architecture applied directly to the original color image. Our network has a very large receptive field equipped with at least 20 convolutional layers and 8 residual units. When training such a deep model however, it is quite difficult to generate edge-preserving images without undesirable color differences. Read More

The retina is a complex nervous system which encodes visual stimuli before higher order processing occurs in the visual cortex. In this study we evaluated whether information about the stimuli received by the retina can be retrieved from the firing rate distribution of Retinal Ganglion Cells (RGCs), exploiting High-Density 64x64 MEA technology. To this end, we modeled the RGC population activity using mean-covariance Restricted Boltzmann Machines, latent variable models capable of learning the joint distribution of a set of continuous observed random variables and a set of binary unobserved random units. Read More

This paper studies the problem of multivariate linear regression where a portion of the observations is grossly corrupted or is missing, and the magnitudes and locations of such occurrences are unknown in priori. To deal with this problem, we propose a new approach by explicitly consider the error source as well as its sparseness nature. An interesting property of our approach lies in its ability of allowing individual regression output elements or tasks to possess their unique noise levels. Read More

We introduce a technique to produce discriminative context-aware image captions (captions that describe differences between images or visual concepts) using only generic context-agnostic training data (captions that describe a concept or an image in isolation). For example, given images and captions of "siamese cat" and "tiger cat", our system generates language that describes the "siamese cat" in a way that distinguishes it from "tiger cat". We start with a generic language model that is context-agnostic and add a listener to discriminate between closely-related concepts. Read More

Despite significant progress, image saliency detection still remains a challenging task in complex scenes and environments. Integrating multiple different but complementary cues, like RGB and Thermal (RGB-T), may be an effective way for boosting saliency detection performance. The current research in this direction, however, is limited by the lack of a comprehensive benchmark. Read More

Learning to hash plays a fundamentally important role in the efficient image and video retrieval and many other computer vision problems. However, due to the binary outputs of the hash functions, the learning of hash functions is very challenging. In this paper, we propose a novel approach to learn stochastic hash functions such that the learned hashing codes can be used to regenerate the inputs. Read More

During the last decades, the number of new full-reference image quality assessment algorithms has been increasing drastically. Yet, despite of the remarkable progress that has been made, the medical ultrasound image similarity measurement remains largely unsolved due to a high level of speckle noise contamination. Potential applications of the ultrasound image similarity measurement seem evident in several aspects. Read More

Humans have rich understanding of liquid containers and their contents; for example, we can effortlessly pour water from a pitcher to a cup. Doing so requires estimating the volume of the cup, approximating the amount of water in the pitcher, and predicting the behavior of water when we tilt the pitcher. Very little attention in computer vision has been made to liquids and their containers. Read More

Understanding what visual features and representations contribute to human object recognition may provide scaffolding for more effective artificial vision systems. While recent advances in Deep Convolutional Networks (DCNs) have led to systems approaching human accuracy, it is unclear if they leverage the same visual features as humans for object recognition. We introduce Clicktionary, a competitive web-based game for discovering features that humans use for object recognition: One participant from a pair sequentially reveals parts of an object in an image until the other correctly identifies its category. Read More

It's useful to automatically transform an image from its original form to some synthetic form (style, partial contents, etc.), while keeping the original structure or semantics. We define this requirement as the "image-to-image translation" problem, and propose a general approach to achieve it, based on deep convolutional and conditional generative adversarial networks (GANs), which has gained a phenomenal success to learn mapping images from noise input since 2014. Read More

This paper reviews the historic of ChaLearn Looking at People (LAP) events. We started in 2011 (with the release of the first Kinect device) to run challenges related to human action/activity and gesture recognition. Since then we have regularly organized events in a series of competitions covering all aspects of visual analysis of humans. Read More

Could we use Computer Vision in the Internet of Things for using pictures as sensors? This is the principal hypothesis that we want to resolve. Currently, in order to create safety areas, cities, or homes, people use IP cameras. Nevertheless, this system needs people who watch the camera images, watch the recording after something occurred, or watch when the camera notifies them of any movement. Read More

In this paper we propose a method for logo recognition using deep learning. Our recognition pipeline is composed of a logo region proposal followed by a Convolutional Neural Network (CNN) specifically trained for logo classification, even if they are not precisely localized. Experiments are carried out on the FlickrLogos-32 database, and we evaluate the effect on recognition performance of synthetic versus real data augmentation, and image pre-processing. Read More

Current methods for statistical analysis of neuroimaging data identify condition related structural alterations in the human brain by detecting group differences. They construct detailed maps showing population-wide changes due to a condition of interest. Although extremely useful, methods do not provide information on the subject-specific structural alterations and they have limited diagnostic value because group assignments for each subject are required for the analysis. Read More

We propose a novel image set classification technique using linear regression models. Downsampled gallery image sets are interpreted as subspaces of a high dimensional space to avoid the computationally expensive training step. We estimate regression models for each test image using the class specific gallery subspaces. Read More

Multi-task learning (MTL) involves the simultaneous training of two or more related tasks over shared representations. In this work, we apply MTL to audio-visual automatic speech recognition(AV-ASR). Our primary task is to learn a mapping between audio-visual fused features and frame labels obtained from acoustic GMM/HMM model. Read More