Matteo Poggi | Tutorials

2024

Stereo Matching in the Twenties Poggi Matteo, and Tosi Fabio (CVPR) [Abs] [HTML]
For decades, stereo matching has been approached by developing hand-crafted algorithms, focused on measuring the visual appearance between local patterns in the two images and propagating this information globally. Since 2015, deep learning led to a paradigm shift in this field, driving the community to the design of end-to-end deep networks capable of matching pixels. The results of this revolution brought stereo matching to a whole new level of accuracy, yet not without any drawbacks. Indeed, some hard challenges remained unsolved by the first generation of deep stereo models, as they were often not capable of properly generalizing across different domains – e.g., from synthetic to real, from indoor to outdoor – or dealing with high-resolution images. This was, however, three years ago. These and other challenges have been faced by the research community in the Twenties, making deep stereo matching even more mature and suitable to be a practical solution for everyday applications. For instance, now we have networks capable of generalizing much better from synthetic to real images, as well as handling high-resolution images or even estimating disparity correctly in the presence of non-Lambertian surfaces – known to be among the ill-posed challenges for stereo. Accordingly, in this tutorial, we aim at giving a comprehensive overview of the state-of-the-art of deep stereo matching, which architectural designs have been crucial to reach this level of maturity and how to select the best solution for estimating depth from stereo in real applications.

2020

Facing depth estimation in-the-wild with deep networks Poggi Matteo, Tosi Fabio, Aleotti Filippo, Batsos Konstantinos, Mordohai Philippos, and Mattoccia Stefano (ECCV) [Abs] [HTML]
Obtaining dense and accurate depth measurements from images is of paramount importance for many 3D computer vision applications. In the last years, stereo and monocular depth estimation have become the most popular techniques for this purpose, with deep neural networks improving consistently over state of the art. However, in contrast to hand-made algorithms, deep learning solutions are particularly data dependent. Indeed, better performance is achieved by training on large and diverse datasets, either real or sourced through computer graphics. Nevertheless, deep networks always suffer non-negligible drops in performance when moving to different domains. These drops in performance could be catastrophic, but also likely to occur when a deep network is deployed in-the-wild. Therefore, in this tutorial, we aim at highlighting the limitations of deep neural networks for stereo and monocular depth estimation and how far we are from their unconstrained deployment in the wild. Then, we will introduce very recent practices aimed at shrinking the gap in performance between academic datasets and the real world.
Learning and understanding single image depth estimation in the wild Poggi Matteo, Tosi Fabio, Aleotti Filippo, Mattoccia Stefano, Godard Clément, Watson Jamie, Firman Michael, and Brostow Gabriel J. (CVPR) [Abs] [HTML]
Depth estimation from a single still image, often referred to as depth-from-mono, although considered for a long time barely feasible, with the advent of deep learning stood out as a viable and effective alternative to more complicated setups as witnessed by the compelling results achieved in very recent years. In this context, particularly appealing is the possibility of learning monocular depth estimation through geometry, replacing the need for ground truth depth annotation (hard to source and limiting practical deployment) with multiple images of the same scene acquired from different viewpoints, in self-supervised manners. This research topic has been extraordinarily active in the last three years, bringing a vast and ever-increasing number of papers and novel contributions published in top-level computer vision venues. For these reasons, in this tutorial, we aim at giving a comprehensive overview of advances in self-supervised depth-from-mono research, highlighting the rapid evolution of this topic and how fast its popularity is growing and, nonetheless, the open challenges. Through the tutorial, we will introduce the audience to the topic of monocular depth estimation, pointing out the main strategies to source self-supervision from images in order to replace traditional supervision from ground truth labels, then exploring additional forms of weak supervision obtained by means of hand-made algorithms or multi-task frameworks. We plan a hands-on session dedicated to deployment on smartphones, discussing tools and practices to run monocular depth estimation on the most popular devices. To conclude, we will give an overview of open challenges by discussing recent works about uncertainty estimation and the interpretability of what neural networks learn to estimate depth from single images.

2019

Learning-based depth estimation from stereo and monocular images: successes, limitations and future challenges Poggi Matteo, Tosi Fabio, Batsos Konstantinos, Mordohai Philippos, and Mattoccia Stefano (CVPR) [Abs] [HTML]
Obtaining dense and accurate depth measurement is of paramount importance for many 3D computer vision applications. Stereo matching has undergone a paradigm shift in the last few years due to the introduction of learning-based methods that replaced heuristics and hand-crafted rules. While in early 2012 the KITTI dataset highlighted how stereo matching was still an open problem, the recent success of Convolutional Neural Networks has led to tremendous progress and has established these methods as the undisputed state of the art. Similar observations can be made on all recent benchmarks, such as the KITTI 2012 and 2015, the Middlebury 2014 and the ETH3D benchmark, the leaderboards of which are dominated by learning-based methods. The tutorial will cover conventional and deep learning methods that have replaced the components of the conventional stereo matching pipeline, end-to-end stereo systems and confidence estimation. The second part will focus on related problems, specifically single-view depth estimation and multi-view stereo, that have also benefited from the availability of ground truth datasets and learning algorithms. The tutorial will conclude with open problems including generalization as well as unsupervised and weakly supervised training.

2018

Learning-based depth estimation from stereo and monocular images: successes, limitations and future challenges Poggi Matteo, Tosi Fabio, Batsos Konstantinos, Mordohai Philippos, and Mattoccia Stefano (3DV) [Abs] [HTML]
Obtaining dense and accurate depth measurement is of paramount importance for many 3D computer vision applications. Stereo matching has undergone a paradigm shift in the last few years due to the introduction of learning-based methods that replaced heuristics and hand-crafted rules. While in early 2012 the KITTI dataset highlighted how stereo matching was still an open problem, the recent success of Convolutional Neural Networks has led to tremendous progress and has established these methods as the undisputed state of the art. Similar observations can be made on all recent benchmarks, such as the KITTI 2012 and 2015, the Middlebury 2014 and the ETH3D benchmark, the leaderboards of which are dominated by learning-based methods. The tutorial will cover conventional and deep learning methods that have replaced the components of the conventional stereo matching pipeline, end-to-end stereo systems and confidence estimation. The second part will focus on related problems, specifically single-view depth estimation and multi-view stereo, that have also benefited from the availability of ground truth datasets and learning algorithms. The tutorial will conclude with open problems including generalization as well as unsupervised and weakly supervised training.