In this talk I will present an overview and the latest results of the project Aerial Outdoor Motion Capture (AirCap), running at the Perceiving Systems department. AirCap's goal is to achieve markerless and unconstrained human motion capture (MoCap) in unknown and unstructured outdoor environments. To this end, we have developed a flying MoCap system using a team of autonomous aerial robots with on-board, monocular RGB cameras. Our system is endowed with a range of novel functionalities which was developed by our group over the last 3 years. These include, i) cooperative detection and tracking that enables DNN-based detectors on board flying robots, ii) active cooperative perception in aerial robot teams to minimize joint tracking uncertainty, and iii) markerless human pose and shape estimation using images acquired from multiple views and approximately calibrated cameras. We have conducted several real experiments along with ground truth comparisons to validate our system. Overall, for outdoor scenarios we have demonstrated the first fully autonomous flying MoCap system involving multiple aerial robots.
Organizers: Katherine J. Kuchenbecker
In this talk I will consider the problem of scene-level inverse rendering to recover shape, reflectance and lighting from a single, uncontrolled, outdoor image. This task is highly ill-posed, but we show that multiview self-supervision, a natural lighting prior and implicit lighting estimation allow an image-to-image CNN to solve the task, seemingly learning some general principles of shape-from-shading along the way. Adding a neural renderer and sky generator GAN, our approach allows us to synthesise photorealistic relit images under widely varying illumination. I will finish by briefly describing recent work in which some of these ideas have been combined with deep face model fitting replacing parameter regression with correspondence prediction enabling fully unsupervised training.
Organizers: Timo Bolkart
Licklider and Taylor (1968) envisioned computational machinery that could enable better communication between humans than face-to-face interaction. In the last fifty years, we have used computing to develop various means of communication, such as mail, messaging, phone calls, video conversation, and virtual reality. These are, however, a proxy of face-to-face communication that aims at encoding words, expressions, emotions, and body language at the source and decoding them reliably at the destination. The true revolution of personal computing has not begun yet because we have not been able to tap the real potential of computing for social communication. A computational machinery that can understand and create a four-dimensional audio-visual world can enable humans to describe their imagination and share it with others. In this talk, I will introduce the Computational Studio: an environment that allows non-specialists to construct and creatively edit the 4D audio-visual world from sparse audio and video samples. The Computational Studio aims to enable everyone to relive old memories through a form of virtual time travel, to automatically create new experiences, and share them with others using everyday computational devices. There are three essential components of the Computational Studio: (1) how can we capture 4D audio-visual world?; (2) how can we synthesize the audio-visual world using examples?; and (3) how can we interactively create and edit the audio-visual world? The first part of this talk introduces the work on capturing and browsing in-the-wild 4D audio-visual world in a self-supervised manner and efforts on building a multi-agent capture system. The applications of this work apply to social communication and to digitizing intangible cultural heritage, capturing tribal dances and wildlife in the natural environment, and understanding the social behavior of human beings. In the second part, I will talk about the example-based audio-visual synthesis in an unsupervised manner. Example-based audio-visual synthesis allows us to express ourselves easily. Finally, I will talk about the interactive visual synthesis that allows us to manually create and edit visual experiences. Here I will also stress the importance of thinking about a human user and computational devices when designing content creation applications. The Computational Studio is a first step towards unlocking the full degree of creative imagination, which is currently limited to the human mind by the limits of the individual's expressivity and skill. It has the potential to change the way we audio-visually communicate with others.
Accurate 3D human pose estimation has been a longstanding goal in computer vision. However, till now, it has only gained limited success in easy scenarios such as studios which have little occlusion. In this talk, I will present our two works aiming to address the occlusion problem in realistic scenarios. In the first work, we present an approach to recover absolute 3D human pose of single person from multi-view images by incorporating multi-view geometric priors in our model. It consists of two separate steps: (1) estimating the 2D poses in multi-view images and (2) recovering the 3D poses from the multi-view 2D poses. First, we introduce a cross-view fusion scheme into CNN to jointly estimate 2D poses for multiple views. Consequently, the 2D pose estimation for each view already benefits from other views. Second, we present a recursive Pictorial Structure Model to recover the 3D pose from the multi-view 2D poses. It gradually improves the accuracy of 3D pose with affordable computational cost. In the second work, we present a 3D pose estimator which allows us to reliably estimate and track people in crowded scenes. In contrast to the previous efforts which require to establish cross-view correspondence based on noisy and incomplete 2D pose estimations, we present an end-to-end solution which directly operates in the 3D space, therefore avoids making incorrect hard decisions in the 2D space. To achieve this goal, the features in all camera views are warped and aggregated in a common 3D space, and fed to Cuboid Proposal Network (CPN) to coarsely localize all people. Then we propose Pose Regression Network (PRN) to estimate a detailed 3D pose for each proposal. The approach is robust to occlusion which occurs frequently in practice. Without bells and whistles, it significantly outperforms the state-of-the-arts on the benchmark datasets.
Organizers: Chun-Hao Paul Huang
Traditional voice conversion methods rely on parallel recordings of multiple speakers pronouncing the same sentences. For real-world applications however, parallel data is rarely available. We propose MelGAN-VC, a voice conversion method that relies on non-parallel speech data and is able to convert audio signals of arbitrary length from a source voice to a target voice. We firstly compute spectrograms from waveform data and then perform a domain translation using a Generative Adversarial Network (GAN) architecture. An additional siamese network helps preserving speech information in the translation process, without sacrificing the ability to flexibly model the style of the target speaker. We test our framework with a dataset of clean speech recordings, as well as with a collection of noisy real-world speech examples. Finally, we apply the same method to perform music style transfer, translating arbitrarily long music samples from one genre to another, and showing that our framework is flexible and can be used for audio manipulation applications different from voice conversion.
In this talk I will present an overview of our recent works that learn deep geometric models for the 3D face from large datasets of scans. Priors for the 3D face are crucial for many applications: to constrain ill posed problems such as 3D reconstruction from monocular input, for efficient generation and animation of 3D virtual avatars, or even in medical domains such as recognition of craniofacial disorders. Generative models of the face have been widely used for this task, as well as deep learning approaches that have recently emerged as a robust alternative. Barring a few exceptions, most of these data-driven approaches were built from either a relatively limited number of samples (in the case of linear models of the shape), or by synthetic data augmentation (for deep-learning based approaches), mainly due to the difficulty in obtaining large-scale and accurate 3D scans of the face. Yet, there is a substantial amount of 3D information that can be gathered when considering publicly available datasets that have been captured over the last decade. I will discuss here our works that tackle the challenges of building rich geometric models out of these large and varied datasets, with the goal of modeling the facial shape, expression (i.e. motion) or geometric details. Concretely, I will talk about (1) an efficient and fully automatic approach for registration of large datasets of 3D faces in motion; (2) deep learning methods for modeling the facial geometry that can disentangle the shape and expression aspects of the face; and (3) a multi-modal learning approach for capturing geometric details from images in-the-wild, by simultaneously encoding both facial surface normal and natural image information.
Organizers: Jinlong Yang
Motivated by the low voltage driven actuation of ionic Electroactive Polymers (iEAPs)  , recently we began investigating ionic elastomers. In this talk I will discuss the preparation, physical characterization and electric bending actuation properties of two novel ionic elastomers; ionic polymer electrolyte membranes (iPEM), and ionic liquid crystal elastomers (iLCE). Both materials can be actuated by low frequency AC or DC voltages of less than 1 V. The bending actuation properties of the iPEMs are outperforming most of the well-developed iEAPs, and the not optimized first iLCEs are already comparable to them. Ionic liquid crystal elastomers also exhibit superior features, such as the alignment dependent actuation, which offers the possibility of pre-programed actuation pattern at the level of cross-linking process. Additionally, multiple (thermal, optical and electric) actuations are also possible. I will also discuss issues with compliant electrodes and possible soft robotic applications.  Y. Bar-Cohen, Electroactive Polyer Actuators as Artficial Muscles: Reality, Potential and Challenges, SPIE Press, Bellingham, 2004.  O. Kim, S. J. Kim, M. J. Park, Chem. Commun. 2018, 54, 4895.  C. P. H. Rajapaksha, C. Feng, C. Piedrahita, J. Cao, V. Kaphle, B. Lüssem, T. Kyu, A. Jákli, Macromol. Rapid Commun. 2020, in print.  C. Feng, C. P. H. Rajapaksha, J. M. Cedillo, C. Piedrahita, J. Cao, V. Kaphle, B. Lussem, T. Kyu, A. I. Jákli, Macromol. Rapid Commun. 2019, 1900299.
In this talk I will discuss the development of functional materials and their application in modulating the biological microenvironment during cellular sensing and signal transduction. First, I’ll briefly summarize the mechanical, biochemical and physicochemical material properties that influence cellular sensing and subsequent integration with the tissues at the macroscale. Controlling signal transduction at the submicron scale, however, requires careful materials engineering to address the need for minimally invasive targeting of single proteins and for providing sufficient physical stimuli for cellular signaling. I will discuss an approach to fabricate anisotropic magnetite nanodiscs (MNDs) which can be used as torque transducers to mechanosensory cells under weak, slowly varying magnetic fields (MFs). When MNDs are coupled to MFs, their magnetization transitions between a vortex and in-plane state, leading to torques on the pN scale, sufficient to activate mechanosensitive ion channels in neuronal cell membranes. This approach opens new avenues for studies of biological mechanoreception and provides new tools for minimally invasive neuromodulation technology.
Optoacoustic imaging is increasingly attracting the attention of the biomedical research community due to its excellent spatial and temporal resolution, centimeter scale penetration into living tissues, versatile endogenous and exogenous optical absorption contrast. State-of-the-art implementations of multi-spectral optoacoustic tomography (MSOT) are based on multi-wavelength excitation of tissues to visualize specific molecules within opaque tissues. As a result, the technology can noninvasively deliver structural, functional, metabolic, and molecular information from living tissues. The talk covers most recent advances pertaining ultrafast imaging instrumentation, multi-modal combinations with optical and ultrasound methods, intelligent reconstruction algorithms as well as smart optoacoustic contrast and sensing approaches. Our current efforts are also geared toward exploring potential of the technique in studying multi-scale dynamics of the brain and heart, monitoring of therapies, fast tracking of cells and targeted molecular imaging applications. MSOT further allows for a handheld operation thus offers new level of precision for clinical diagnostics of patients in a number of indications, such as breast and skin lesions, inflammatory diseases and cardiovascular diagnostics.
Organizers: Metin Sitti
Machine learning allows automated systems to identify structures and physical laws based on measured data, which is particularly useful in areas where an analytic derivation of a model is too tedious or not possible. Research in reinforcement learning led to impressive results and superhuman performance in well-structured tasks and games. However, to this day, data-driven models are rarely employed in the control of safety critical systems, because the success of a controller, which is based on these models, cannot be guaranteed. Therefore, the research presented in this talk analyzes the closed-loop behavior of learning control laws by means of rigorous proofs. More specifically, we propose a control law based on Gaussian process (GP) models, which actively avoids uncertainties in the state space and favors trajectories along the training data, where the system is well-known. We show that this behavior is optimal as it maximizes the probability of asymptotic stability. Additionally, we consider an event-triggered online learning control law, which safely explores an initially unknown system. It only takes new training data whenever the uncertainty in the system becomes too large. As the control law only requires a locally precise model, this novel learning strategy has a high data efficiency and provides safety guarantees.
Organizers: Sebastian Trimpe