Imagine a futuristic version of Google Street View that could dial up any possible place in the world, at any possible time. Effectively, such a service would be a recording of the plenoptic function—the hypothetical function described by Adelson and Bergen that captures all light rays passing through space at all times. While the plenoptic function is completely impractical to capture in its totality, every photo ever taken represents a sample of this function. I will present recent methods we've developed to reconstruct the plenoptic function from sparse space-time samples of photos—including Street View itself, as well as tourist photos of famous landmarks. The results of this work include the ability to take a single photo and synthesize a full dawn-to-dusk timelapse video, as well as compelling 4D view synthesis capabilities where a scene can simultaneously be explored in space and time.
One of the most striking characteristics of human behavior in contrast to all other animal is that we show extraordinary variability across populations. Human cultural diversity is a biological oddity. More specifically, we propose that what makes humans unique is the nature of the individual ontogenetic process, that results in this unparalleled cultural diversity. Hence, our central question is: How is human ontogeny adapted to cultural diversity and how does it contribute to it? This question is critical, because cultural diversity does not only entail our predominant mode of adaptation to local ecologies, but is key in the construction of our cognitive architecture. The colors we see, the tones that we hear, the memories we form, the norms we adhere to are all the consequence of an interaction between our emerging cognitive system and our lived experiences. While psychologists make careers measuring cognitive systems, we are terrible at measuring experience as are anthropologists, sociologists, etc. The standard methods all face unsurmountable limitations. In our department, we hope to apply Machine Learning, Deep Learning and Computer Vision to automatically extract developmentally important indicators of humans’ daily experience. Similarly to the way that modern sequencing technologies allow us to study the human genotype at scale, applying AI methods to reliably quantify humans’ lived experience would allow us to study the human behavioral phenotype at scale, and fundamentally alter the science of human behavior and its application in education, mental health and medicine: The phenotyping revolution.
Organizers: Timo Bolkart
3D reconstruction from images has been a tremendous success-story of computer vision, with city-scale reconstruction now a reality. However, these successes apply almost exclusively in a static world, where the only motion is that of the camera. Even with the advent of realtime depth cameras, full 3D modelling of dynamic scenes lags behind the rigid-scene case, and for many objects of interest (e.g. animals moving in natural environments), depth sensing remains challenging. In this talk, I will discuss a range of recent work in the modelling of nonrigid real-world 3D shape from 2D images, for example building generic animal models from internet photo collections. While the state of the art depends heavily on dense point tracks from textured surfaces, it is rare to find suitably textured surfaces: most animals are limited in texture (think of dogs, cats, cows, horses, …). I will show how this assumption can be relaxed by incorporating the strong constraints given by the object’s silhouette.
Interface Science. The last decade has witnessed explosive growth of research on various oxide heterostructures, and discoveries of exciting new interface phenomena. We may be witnessing the emergence of a new scientific discipline – Interface Science, delineated by a distinct new set of problems, techniques, phenomena, and theoretical concepts.
Significant progress has been made over the last years in estimating people's shape and motion from video and nonetheless the problem still remains unsolved. This is especially true in uncontrolled environments such as people in the streets or the office where background clutter and occlusions make the problem even more challenging.
The goal of our research is to develop computational methods that enable human pose estimation from video and inertial sensors in indoor and outdoor environments. Specifically, I will focus on one of our past projects in which we introduce a hybrid Human Motion Capture system that combines video input with sparse inertial sensor input. Employing a particle-based optimization scheme, our idea is to use orientation cues derived from the inertial input to sample particles from the manifold of valid poses. Additionally, we introduce a novel sensor noise model to account for uncertainties based on the von Mises-Fisher distribution. Doing so, orientation constraints are naturally fulfilled and the number of needed particles can be kept very small. More generally, our method can be used to sample poses that fulfill arbitrary orientation or positional kinematic constraints. In the experiments, we show that our system can track even highly dynamic motions in an outdoor environment with changing illumination, background clutter, and shadows.
There are an estimated 3.5 trillion photographs in the world, of which 10% have been taken in the past 12 months. Facebook alone reports 6 billion photo uploads per month. Every minute, 72 hours of video are uploaded to YouTube. Cisco estimates that in the next few years, visual data (photos and video) will account for over 85% of total internet traffic. Yet, we currently lack effective computational methods for making sense of all this mass of visual data. Unlike easily indexed content, such as text, visual content is not routinely searched or mined; it's not even hyperlinked. Visual data is Internet's "digital dark matter" [Perona,2010] -- it's just sitting there!
In this talk, I will first discuss some of the unique challenges that make Big Visual Data difficult compared to other types of content. In particular, I will argue that the central problem is the lack a good measure of similarity for visual data. I will then present some of our recent work that aims to address this challenge in the context of visual matching, image retrieval and visual data mining. As an application of the latter, we used Google Street View data for an entire city in an attempt to answer that age-old question which has been vexing poets (and poets-turned-geeks): "What makes Paris look like Paris?"
Studying the interface between artificial and biological vision has been an area of research that has been greatly promoted for a long time. It seems promising that cognitive science can provide new ideas to interface computer vision and human perception, yet no established design principles do exist. In the first part of my talk I am going to introduce the novel concept of 'object detectability'. Object detectability refers to a measure of how likely a human observer is visually aware of the location and presence of specific object types in a complex, dynamic, urban scene.
We have shown a proof of concept of how to maximize human observers' scene awareness in a dynamic driving context. Nonlinear functions are learnt from experimental samples of a combined feature vector of human gaze and visual features mapping to object detectabilities. We obtain object detectabilities through a detection experiment, simulating a proxy task of distracted real-world driving. In order to specifically enhance overall pedestrian detectability in a dynamic scene, the sum of individual detectability predictors defines a complex cost function that we seek to optimize with respect to human gaze. Results show significantly increased human scene awareness in hazardous test situations comparing optimized gaze and random fixation. Thus, our approach can potentially help a driver to save reaction time and resolve a risky maneuvre. In our framework, the remarkable ability of the human visual system to detect specific objects in the periphery has been implicitly characterized by our perceptual detectability task and has thus been taken into account.
The framework may provide a foundation for future work to determine what kind of information a Computer Vision system should process reliably, e.g. certain pose or motion features, in order to optimally alert a driver in time-critical situations. Dynamic image data was taken from the Caltech Pedestrian database. I will conclude with a brief overview of recent work, including a new circular output random regression forest for continuous object viewpoint estimation and a novel learning-based, monocular odometry approach based on robust LVMs and sensorimotor learning, offering stable 3D information integration. Last but not least, I present results of a perception experiment to quantify emotion in estimated facial movement synergy components that can be exploited to control emotional content of 3D avatars in a perceptually meaningful way.
This work was done in particular with David Engel (now a Post-Doc at M.I.T.), Christian Herdtweck (a PhD student at MPI Biol. Cybernetics), and in collaboration with Prof. Martin A. Giese and Dr. Enrico Chiovetto, Center for Integrated Neuroscience, Tübingen.
We present a supervised learning based method to estimate a per-pixel confidence for optical flow vectors. Regions of low texture and pixels close to occlusion boundaries are known to be difficult for optical flow algorithms. Using a spatiotemporal feature vector, we estimate if a flow algorithm is likely to fail in a given region.
Our method is not restricted to any specific class of flow algorithm, and does not make any scene specific assumptions. By automatically learning this confidence we can combine the output of several computed flow fields from different algorithms to select the best performing algorithm per pixel. Our optical flow confidence measure allows one to achieve better overall results by discarding the most troublesome pixels. We illustrate the effectiveness of our method on four different optical flow algorithms over a variety of real and synthetic sequences. For algorithm selection, we achieve the top overall results on a large test set, and at times even surpasses the results of the best algorithm among the candidates.
Semantic image segmentation is the task of assigning semantic labels to the pixels of a natural image. It is an important step towards general scene understanding and has lately received much attention in the computer vision community. It was found that detailed annotation of images are helpful for solving this task, but obtaining accurate and consistent annotations still proves to be difficult on a large scale. One possible way forward is to work with partial supervision and latent variable models to infer semantic annotations from the data during training.
The talk will present two approaches working with partial supervision for image segmentation. The first uses an efficient multi-instance formulation to obtain object class segmentations when trained on class labels alone. The second uses a latent CRF formulation to extract object parts based on object class segmentation.
In this talk I will present two lines of research which are both applied to the problem of stereo matching. The first line of research tries to make progress on the very traditional problem of stereo matching. In BMVC 11 we presented the PatchmatchStereo work which achieves surprisingly good results with a simple energy function consisting of unary terms only. As optimization engine we used the PatchMatch method, which was designed for image editing purposes. In BMVC 12 we extended this work by adding to the energy function the standard pairwise smoothness terms. The main contribution of this work is the optimization technique, which we call PatchMatch-BeliefPropagation (PMBP). It is a special case of max-product Particle Belief Propagation, with a new sampling schema motivated by Patchmatch.
The method may be suitable for many energy minimization problems in computer vision, which have a non-convex, continuous and potentially high-dimensional label space. The second line of research combines the problem of stereo matching with the problem of object extracting in the scene. We show that both tasks can be solved jointly and boost the performance of each individual task. In particular, stereo matching improves since objects have to obey physical properties, e.g. they are not allowed to fly in the air. Object extracting improves, as expected, since we have additional information about depth in the scene.
Three-dimensional object shape is commonly represented in terms of deformations of a triangular mesh from an exemplar shape. In particular, statistical generative models of human shape deformation are widely used in computer vision, graphics, ergonomics, and anthropometry. Existing statistical models, however, are based on a Euclidean representation of shape deformations. In contrast, we argue that shape has a manifold structure: For example, averaging the shape deformations for two people does not necessarily yield a meaningful shape deformation, nor does the Euclidean difference of these two deformations provide a meaningful measure of shape dissimilarity. Consequently, we define a novel manifold for shape representation, with emphasis on body shapes, using a new Lie group of deformations. This has several advantages.
First, we define triangle deformations exactly, removing non-physical deformations and redundant degrees of freedom common to previous methods. Second, the Riemannian structure of Lie Bodies enables a more meaningful definition of body shape similarity by measuring distance between bodies on the manifold of body shape deformations. Third, the group structure allows the valid composition of deformations.
This is important for models that factor body shape deformations into multiple causes or represent shape as a linear combination of basis shapes. Similarly, interpolation between two mesh deformations results in a meaningful third deformation. Finally body shape variation is modeled using statistics on manifolds. Instead of modeling Euclidean shape variation with Principal Component Analysis we capture shape variation on the manifold using Principal Geodesic Analysis. Our experiments show consistent visual and quantitative advantages of Lie Bodies over traditional Euclidean models of shape deformation and our representation can be easily incorporated into existing methods. This project is part of a larger effort that brings together statistics and geometry to model statistics on manifolds.
Our research on manifold-valued statistics addresses the problem of modeling statistics in curved feature spaces. We try to find the geometrically most natural representations that respect the constraints; e.g. by modeling the data as belonging to a Lie group or a Riemannian manifold. We take a geometric approach as this keeps the focus on good distance measures, which are essential for good statistics. I will also present some recent unpublished results related to statistics on manifolds with broad application.
We, first, address the problems of large scale image classification. We present and evaluate different ways of aggregating local image descriptors into a vector and show that the Fisher kernel achieves better performance than the reference bag-of-visual words approach for any given vector dimension. We show and interpret the importance of an appropriate vector normalization.
Furthermore, we discuss how to learn given a large number of classes and images with stochastic gradient descent and show results on ImageNet10k. We, then, present a weakly supervised approach for learning human actions modeled as interactions between humans and objects.
Our approach is human-centric: we first localize a human in the image and then determine the object relevant for the action and its spatial relation with the human. The model is learned automatically from a set of still images annotated (only) with the action label.
Finally, we present work on learning object detectors from realworld web videos known only to contain objects of a target class. We propose a fully automatic pipeline that localizes objects in a set of videos of the class and learns a detector for it. The approach extracts candidate spatio-temporal tubes based on motion segmentation and then selects one tube per video jointly over all videos.