No one has successfully defined life but the properties we often associate with living things are motility, metabolism and self-replication. According to the Nobel Laureate Richard Feynman: "What I can't create, I don't understand". We thought we'd give it a shot – understanding life – and in the process we've made two different systems, one that exhibits both autonomous motility and metabolism and another which is the first artificial system which can replicate arbitrarily designed motifs.
This talk presents recent work from CVPR that looks at inference for pairwise CRF models in the highly (or fully) connected case rather than simply a sparse set of neighbours used ubiquitously in many computer vision tasks. Recent work has shown that fully-connected CRFs, where each node is connected to every other node, can be solved very efficiently under the restriction that the pairwise term is a Gaussian kernel over a Euclidean feature space. The method presented generalises this model to allow arbitrary, non-parametric models (which can be learnt from training data and conditioned on test data) to be used for the pairwise potentials. This greatly increases the expressive power of such models whilst maintaining efficient inference.
In this talk I will detail methods for simultaneous 2D/3D segmentation, tracking and reconstruction which incorporate high level shape information. I base my work on the assumption that the space of possible 2D object shapes can be either generated by projecting down known rigid 3D shapes or learned from 2D shape examples. I minimise the discrimination between statistical foreground and background appearance models with respect to the parameters governing the shape generative process (the 6 degree-of-freedom 3D pose of the 3D shape or the parameters of the learned space). The foreground region is delineated by the zero level set of a signed distance function, and I define an energy over this region and its immediate background surroundings based on pixel-wise posterior membership probabilities. I obtain the differentials of this energy with respect to the parameters governing shape and conduct searches for the correct shape using standard non-linear minimisation techniques. This methodology first leads to a novel rigid 3D object tracker. For a known 3D shape, the optimisation here aims to find the 3D pose that leads to the 2D projection that best segments a given image. I also extend my approach to track multiple objects from multiple views and show how depth (such as may be available from a Kinect sensor) can be integrated in a straighforward manner. Next, I explore deformable 2D/3D object tracking. I use a non-linear and probabilistic dimensionality reduction, called Gaussian Process Latent Variable Models, to learn spaces of shape. Segmentation becomes a minimisation of an image-driven energy function in the learned space. I can represent both 2D and 3D shapes which I compress with Fourier-based transforms, to keep inference tractable. I extend this method by learning joint shape-parameter spaces, which, novel to the literature, enable simultaneous segmentation and generic parameter recovery. These can describe anything from 3D articulated pose to eye gaze. Finally, I will also be discussing various applications of the proposed techniques, ranging from (limited) articulated hand tracking to semantic SLAM.
Humans are very good at recognizing objects as well as the materials that they are made of. We can easily tell cheese from butter, silk from linen and snow from ice just by looking. Understanding material perception is important for many real-world applications. For instance, a robot cooking in the kitchen will benefit from the knowledge of material perception when deciding if food is cooked or raw. In this talk, I will present studies that are motivated by two important applications of material perception: online shopping and computer graphics (CG) rendering. First, I will discuss the image cues that allow humans to infer tactile and mechanical information about deformable materials. I will present an experiment in which subjects were asked to match their tactile and visual perception of fabrics. I will show that image cues such as 3D folds and color are important for predicting subjects' tactile perception. Not only do these findings have immediate practical implications (e.g., improving online shopping interfaces for fabrics), but they also have theoretical implications: image-based visual cues affect tactile perception. Second, I will present a project on the visual perception of translucent materials (e.g., wax, milk, and jade) using computer-rendered stimuli. Humans are very sensitive to subtle differences in translucency (e.g., baby skin vs. adult skin), however, it is difficult to render translucent materials realistically. I will show how we measured the perceptual dimensions of physical scattering parameter space and used those measurements to produce more realistic renderings of materials like marble and jade. Taken together, my findings highlight the importance of material perception in the real world, and demonstrate how human perception can contribute to applications in computer vision and graphics.
Sensors acquire an increasing amount of diverse information posing two challenges. Firstly, how can we efficiently deal with such a big amount of data and secondly, how can we benefit from this diversity? In this talk I will first present an approach to deal with large graphical models. The presented method distributes and parallelizes the computation and memory requirements while preserving convergence and optimality guarantees of existing inference and learning algorithms. I will demonstrate the effectiveness of the approach on stereo reconstruction from high-resolution imagery. In the second part I will present a unified framework for structured prediction with latent variables which includes hidden conditional random fields and latent structured support vector machines as special cases. This framework allows to linearly combine different sources of information and I will demonstrate its efficacy on the problem of estimating the 3D room layout given a single image. For the latter problem I will in a third part introduce a globally optimal yet efficient inference algorithm based on branch-and-bound.
Consumer level depth cameras such as Kinect have changed the landscape of 3D computer vision. In this talk we will discuss two approaches that both learn to directly infer correspondences between observed depth image pixels and 3D model points. These correspondences can then be used to drive an optimization of a generative model to explain the data. The first approach, the "Vitruvian Manifold", aims to fit an articulated 3D human model to a depth camera image, and extends our original Body Part Recognition algorithm used in Kinect. It applies a per-pixel regression forest to infer direct correspondences between image pixels and points on a human mesh model. This allows an efficient “one-shot” continuous optimization of the model parameters to recover the human pose. The second approach, "Scene Coordinate Regression", addresses the problem of camera pose relocalization. It uses a similar regression forest, but now aims to predict correspondences between observed image pixels and 3D world coordinates in an arbitrary 3D scene. These correspondences are again used to drive an efficient optimization of the camera pose to a highly accurate result from a single input frame.
Developing autonomous systems that are able to assist humans in everyday's tasks is one of the grand challenges in modern computer science. Notable examples are personal robotics for the elderly and people with disabilities, as well as autonomous driving systems which can help decrease fatalities caused by traffic accidents. In order to perform tasks such as navigation, recognition and manipulation of objects, these systems should be able to efficiently extract 3D knowledge of their environment. In this talk, I'll show how Markov random fields provide a great mathematical formalism to extract this knowledge. In particular, I'll focus on a few examples, i.e., 3D reconstruction, 3D layout estimation, 2D holistic parsing and object detection, and show representations and inference strategies that allow us to achieve state-of-the-art performance as well as several orders of magnitude speed-ups.