Growth of the internet and social media has spurred the sharing and dissemination of personal data at large scale. At the same time, recent developments in computer vision has enabled unseen effectiveness and efficiency in automated recognition. It is clear that visual data contains private information that can be mined, yet the privacy implications of sharing such data have been less studied in computer vision community. In the talk, I will present some key results from our study of the implications of the development of computer vision on the identifiability in social media, and an analysis of existing and new anonymisation techniques. In particular, we show that adversarial image perturbations (AIP) introduce human invisible perturbations on the input image that effectively misleads a recogniser. They are far more aesthetic and effective compared to e.g. face blurring. The core limitation, however, is that AIPs are usually generated against specific target recogniser(s), and it is hard to guarantee the performance against uncertain, potentially adaptive recognisers. As a first step towards dealing with the uncertainty, we have introduced a game theoretical framework to obtain the user’s privacy guarantee independent of the randomly chosen recogniser (within some fixed set).
Organizers: Siyu Tang
In the recent years, commodity 3D sensors have become easily and widely available. These advances in sensing technology have spawned significant interest in using captured 3D data for mapping and semantic understanding of 3D environments. In this talk, I will give an overview of our latest research in the context of 3D reconstruction of indoor environments. I will further talk about the use of 3D data in the context of modern machine learning techniques. Specifically, I will highlight the importance of training data, and how can we efficiently obtain labeled and self-supervised ground truth training datasets from captured 3D content. Finally, I will show a selection of state-of-the-art deep learning approaches, including discriminative semantic labeling of 3D scenes and generative reconstruction techniques.
Organizers: Despoina Paschalidou
Autonomous systems rely on learning from experience to automatically refine their strategy and adapt to their environment, and thereby have huge advantages over traditional hand engineered systems. At PROWLER.io we use reinforcement learning (RL) for sequential decision making under uncertainty to develop intelligent agents capable of acting in dynamic and unknown environments. In this talk we first give a general overview of the goals and the research conducted at PROWLER.io. Then, we will talk about two specific research topics. The first is Information-Theoretic Model Uncertainty which deals with the problem of making robust decisions that take into account unspecified models of the environment. The second is Deep Model-Based Reinforcement Learning which deals with the problem of learning the transition and the reward function of a Markov Decision Process in order to use it for data-efficient learning.
Organizers: Michel Besserve
The emergent field of probabilistic numerics has thus far lacked rigorous statistical foundations. We establish that a class of Bayesian probabilistic numerical methods can be cast as the solution to certain non-standard Bayesian inverse problems. This allows us to establish general conditions under which Bayesian probabilistic numerical methods are well-defined, encompassing both non-linear models and non-Gaussian prior distributions. For general computation, a numerical approximation scheme is developed and its asymptotic convergence is established. The theoretical development is then extended to pipelines of numerical computation, wherein several probabilistic numerical methods are composed to perform more challenging numerical tasks. The contribution highlights an important research frontier at the interface of numerical analysis and uncertainty quantification, with some illustrative applications presented.
Organizers: Michael Schober
Our world is dynamic and three-dimensional. Understanding the 3D layout of scenes and the motion of objects is crucial for successfully operating in such an environment. I will talk about two lines of recent research in this direction. One is on end-to-end learning of motion and 3D structure: optical flow estimation, binocular and monocular stereo, direct generation of large volumes with convolutional networks. The other is on sensorimotor control in immersive three-dimensional environments, learned from experience or from demonstration.
We transfer a monocular motion stereo 3D reconstruction algorithm from a mobile device (Google Project Tango Tablet) to a rigidly mounted external camera of higher image resolution. A reliable camera synchronization is crucial for the usability of the tablets IMU data and thus a time synchronization method developed. It is based on the joint movement of the cameras. In a second project, we move from outdoor video scenes to aerial images and strive to segment them into polygonal shapes. While most existing approaches address the problem of automated generation of online maps as a pixel-wise segmentation task, we instead frame this problem as constructing polygons representing objects. An approach based on Faster R-CNN, a successful object detection algorithm, is presented.
Organizers: Siyu Tang
We propose a new architecture for the learning of predictive spatio-temporal motion models from data alone. Our approach, dubbed the Dropout Autoencoder LSTM, is capable of synthesizing natural looking motion sequences over long time horizons without catastrophic drift or mo- tion degradation. The model consists of two components, a 3-layer recurrent neural network to model temporal aspects and a novel auto-encoder that is trained to implicitly recover the spatial structure of the human skeleton via randomly removing information about joints during train- ing time. This Dropout Autoencoder (D-AE) is then used to filter each predicted pose of the LSTM, reducing accumulation of error and hence drift over time. Furthermore, we propose new evaluation protocols to assess the quality of synthetic motion sequences even for which no groundtruth data exists. The proposed protocols can be used to assess generated sequences of arbitrary length. Finally, we evaluate our proposed method on two of the largest motion- capture datasets available to date and show that our model outperforms the state-of-the-art on a variety of actions, including cyclic and acyclic motion, and that it can produce natural looking sequences over longer time horizons than previous methods.
Organizers: Gerard Pons-Moll
Estimating 3D shape from monocular 2D images is a challenging and ill-posed problem. Some of these challenges can be alleviated if 3D shape priors are taken into account. In the field of human body shape estimation, research has shown that accurate 3D body estimations can be achieved through optimization, by minimizing error functions on image cues, such as e.g. the silhouette. These methods though, tend to be slow and typically require manual interactions (e.g. for pose estimation). In this talk, we present some recent works that try to overcome such limitations, achieving interactive rates, by learning mappings from 2D image to 3D shape spaces, utilizing data-driven priors, generated from statistically learned parametric shape models. We demonstrate this, either by extracting handcrafted features or directly utilizing CNN-s. Furthermore, we introduce the notion and application of cross-modal or multi-view learning, where abundance of data coming from various views representing the same object at training time, can be leveraged in a semi-supervised setting to boost estimations at test time. Additionally, we show similar applications of the above techniques for the task of 3D garment estimation from a single image.
Organizers: Gerard Pons-Moll
Human observers can classify photographs of real-world scenes after only a very brief exposure to the image (Potter & Levy, 1969; Thorpe, Fize, Marlot, et al., 1996; VanRullen & Thorpe, 2001). Line drawings of natural scenes have been shown to capture essential structural information required for successful scene categorization (Walther et al., 2011). Here, we investigate how the spatial relationships between lines and line segments in the line drawings affect scene classification. In one experiment, we tested the effect of removing either the junctions or the middle segments between junctions. Surprisingly, participants performed better when shown the middle segments (47.5%) than when shown the junctions (42.2%). It appeared as if the images with middle segments tended to maintain the most parallel/locally symmetric portions of the contours. In order to test this hypothesis, in a second experiment, we either removed the most symmetric half of the contour pixels or the least symmetric half of the contour pixels using a novel method of measuring the local symmetry of each contour pixel in the image. Participants were much better at categorizing images containing the most symmetric contour pixels (49.7%) than the least symmetric (38.2%). Thus, results from both experiments demonstrate that local contour symmetry is a crucial organizing principle in complex real-world scenes. Joint work with John Wilder (UofT CS, Psych), Morteza Rezanejad (McGill CS), Kaleem Siddiqi (McGill CS), Allan Jepson (UofT CS), and Dirk Bernhardt-Walther (UofT Psych), to be presented at VSS 2017.
Organizers: Ahmed Osman
Probabilistic deep learning methods have recently made great progress for generative and discriminative modeling. I will give a brief overview of recent developments and then present two contributions. The first is on a generalization of generative adversarial networks (GAN), extending their use considerably. GANs can be shown to approximately minimize the Jensen-Shannon divergence between two distributions, the true sampling distribution and the model distribution. We extend GANs to the class of f-divergences which include popular divergences such as the Kullback-Leibler divergence. This enables applications to variational inference and likelihood-free maximum likelihood, as well as enables GAN models to become basic building blocks in larger models. The second contribution is to consider representation learning using variational autoencoder models. To make learned representations of data useful we need ground them in semantic concepts. We propose a generative model that can decompose an observation into multiple separate latent factors, each of which represents a separate concept. Such disentangled representation is useful for recognition and for precise control in generative modeling. We learn our representations using weak supervision in the form of groups of observations where all samples within a group share the same value in a given latent factor. To make such learning feasible we generalize recent methods for amortized probabilistic inference to the dependent case. Joint work with: Ryota Tomioka (MSR Cambridge), Botond Cseke (MSR Cambridge), Diane Bouchacourt (Oxford)
Organizers: Lars Mescheder
Dynamic events such as family gatherings, concerts or sports events are often photographed by a group of people. The set of still images obtained this way is rich in dynamic content. We consider the question of whether such a set of still images, rather the traditional video sequences, can be used for analyzing the dynamic content of the scene. This talk will describe several instances of this problem, their solutions and directions for future studies. In particular, we will present a method to extend epipolar geometry to predict location of a moving feature in CrowdCam images. The method assumes that the temporal order of the set of images, namely photo-sequencing, is given. We will briefly describe our method to compute photo-sequencing using geometric considerations and rank aggregation. We will also present a method for identifying the moving regions in a scene, which is a basic component in dynamic scene analysis. Finally, we will consider a new vision of developing collaborative CrowdCam, and a first step toward this goal.
Organizers: Jonas Wulff