Matteo Poggi | Publications

Journals [24]

2025

Active Stereo in the Wild through Virtual Pattern Projection Bartolomei Luca, Poggi Matteo, Tosi Fabio, Conti Andrea, and Mattoccia Stefano International Journal on Computer Vision (IJCV) [Abs]
This paper presents a novel general-purpose guided stereo paradigm that mimics the active stereo principle by replacing the unreliable physical pattern projector with a depth sensor. It works by projecting virtual patterns consistent with the scene geometry onto the left and right images acquired by a conventional stereo camera, using the sparse hints obtained from a depth sensor, to facilitate the visual correspondence. Purposely, any depth sensing device can be seamlessly plugged into our framework, enabling the deployment of a virtual active stereo setup in any possible environment and overcoming the severe limitations of physical pattern projection, such as the limited working range and environmental conditions. Exhaustive experiments on indoor and outdoor datasets featuring both long and close range, including those providing raw, unfiltered depth hints from off-the-shelf depth sensors, highlight the effectiveness of our approach in notably boosting the robustness and accuracy of algorithms and deep stereo without any code modification and even without re-training. Additionally, we assess the performance of our strategy on active stereo evaluation datasets with conventional pattern projection. Indeed, in all these scenarios, our virtual pattern projection paradigm achieves state-of-the-art performance. The source code is available at: \urlhttps://github.com/bartn8/vppstereo.
A Survey on Deep Stereo Matching in the Twenties Tosi Fabio, Bartolomei Luca, and Poggi Matteo International Journal on Computer Vision (IJCV) [Abs]
Stereo matching is close to hitting a half-century of history, yet witnessed a rapid evolution in the last decade thanks to deep learning. While previous surveys in the late 2010s covered the first stage of this revolution, the last five years of research brought further ground-breaking advancements to the field. This paper aims to fill this gap in a two-fold manner: first, we offer an in-depth examination of the latest developments in deep stereo matching, focusing on the pioneering architectural designs and groundbreaking paradigms that have redefined the field in the 2020s; second, we present a thorough analysis of the critical challenges that have emerged alongside these advances, providing a comprehensive taxonomy of these issues and exploring the state-of-the-art techniques proposed to address them. By reviewing both the architectural innovations and the key challenges, we offer a holistic view of deep stereo matching and highlight the specific areas that require further investigation. To accompany this survey, we maintain a regularly updated project page that catalogs papers on deep stereo matching in our Awesome-Deep-Stereo-Matching repository.

2024

Neural Disparity Refinement Tosi Fabio, Aleotti Filippo, Zama Ramirez Pierluigi, Poggi Matteo, Salti Samuele, Mattoccia Stefano, and Di Stefano Luigi IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) [Abs]
We propose a framework that combines traditional, hand-crafted algorithms and recent advances in deep learning to obtain high-quality, high-resolution disparity maps from stereo images. By casting the refinement process as a continuous feature sampling strategy, our neural disparity refinement network can estimate an enhanced disparity map at any output resolution. Our solution can process any disparity map produced by classical stereo algorithms, as well as those predicted by modern stereo networks or even different depth-from-images approaches, such as the COLMAP structure-from-motion pipeline. Nonetheless, when deployed in the former configuration, our framework performs at its best in terms of zero-shot generalization from synthetic to real images. Moreover, its continuous formulation allows for easily handling the unbalanced stereo setup very diffused in mobile phones.
RGB Guided ToF Imaging System: A Survey of Deep Learning-based Methods Qiao Xin, Poggi Matteo, Deng Pengchao, Wei Hao, Ge Chenyang, and Mattoccia Stefano International Journal on Computer Vision (IJCV) [Abs]
Integrating an RGB camera into a ToF imaging system has become a significant technique for perceiving the real world. The RGB guided ToF imaging system is crucial to several applications, including face anti-spoofing, saliency detection, and trajectory prediction. Depending on the distance of the working range, the implementation schemes of the RGB guided ToF imaging systems are different. Specifically, ToF sensors with a uniform field of illumination, which can output dense depth but have low resolution, are typically used for close-range measurements. In contrast, LiDARs, which emit laser pulses and can only capture sparse depth, are usually employed for long-range detection. In the two cases, depth quality improvement for RGB guided ToF imaging corresponds to two sub-tasks: guided depth super-resolution and guided depth completion. In light of the recent significant boost to the field provided by deep learning, this paper comprehensively reviews the works related to RGB guided ToF imaging, including network structures, learning strategies, evaluation metrics, benchmark datasets, and objective functions. Besides, we present quantitative comparisons of state-of-the-art methods on widely used benchmark datasets. Finally, we discuss future trends and the challenges in real applications for further research.

2023

Booster: a Benchmark for Depth from Images of Specular and Transparent Surfaces Zama Ramirez Pierluigi, Costanzino Alex, Tosi Fabio, Poggi Matteo, Salti Samuele, Mattoccia Stefano, and Di Stefano Luigi IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) [Abs]
Estimating depth from images nowadays yields outstanding results, both in terms of in-domain accuracy and generalization. However, we identify two main challenges that remain open in this field: dealing with non-Lambertian materials and effectively processing high-resolution images. Purposely, we propose a novel dataset that includes accurate and dense ground-truth labels at high resolution, featuring scenes containing several specular and transparent surfaces. Our acquisition pipeline leverages a novel deep space-time stereo framework, enabling easy and accurate labeling with sub-pixel precision. The dataset is composed of 606 samples collected in 85 different scenes, each sample includes both a high-resolution pair (12 Mpx) as well as an unbalanced stereo pair (Left: 12 Mpx, Right: 1.1 Mpx), typical of modern mobile devices that mount sensors with different resolutions. Additionally, we provide manually annotated material segmentation masks and 15K unlabeled samples. The dataset is composed of a train set and two test sets, the latter devoted to the evaluation of stereo and monocular depth estimation networks. Our experiments highlight the open challenges and future research directions in this field.
Depth Super-Resolution from Explicit and Implicit High-Frequency Features Qiao Xin, Ge Chenyang, Zhang Youmin, Zhou Yanhui, Tosi Fabio, Poggi Matteo, and Mattoccia Stefano Computer Vision and Image Understanding (CVIU) [Abs]
Guided depth super-resolution aims at using a low-resolution depth map and an associated high-resolution RGB image to recover a high-resolution depth map. However, restoring precise and sharp edges near depth discontinuities and fine structures is still challenging for state-of-the-art methods. To alleviate this issue, we propose a novel multi-stage depth super-resolution network, which progressively reconstructs HR depth maps from explicit and implicit high-frequency information. We introduce an efficient transformer to obtain explicit high-frequency information. The shape bias and global context of the transformer allow our model to focus on high-frequency details between objects, i.e., depth discontinuities, rather than texture within objects. Furthermore, we project the input color images into the frequency domain for additional implicit high-frequency cues extraction. Finally, to incorporate the structural details, we develop a fusion strategy that combines depth features and high-frequency information in the multi-stage-scale framework. Exhaustive experiments on the main benchmarks show that our approach establishes a new state-of-the-art.
Self-supervised Depth Super-resolution with Contrastive Multiview Pre-training Qiao Xin, Ge Chenyang, Zhao Chaoqiang, Tosi Fabio, Poggi Matteo, and Mattoccia Stefano Neural Networks (NN) [Abs]
Many low-level vision tasks, including guided depth super-resolution (GDSR), struggle with the issue of insufficient paired training data. Self-supervised learning is a promising solution, but it remains challenging to upsample depth maps without the explicit supervision of high-resolution target images. To alleviate this problem, we propose a self-supervised depth super-resolution method with contrastive multiview pre-training. Unlike existing contrastive learning methods for classification or segmentation tasks, our strategy can be applied to regression tasks even when trained on a small-scale dataset and can reduce information redundancy by extracting unique features from the guide. Furthermore, we propose a novel mutual modulation scheme that can effectively compute the local spatial correlation between cross-modal features. Exhaustive experiments demonstrate that our method attains superior performance with respect to state-of-the-art GDSR methods and exhibits good generalization to other modalities.

2022

Depth Restoration in Under-Display Time-of-Flight Imaging Qiao Xin, Ge Chenyang, Deng Pengchao, Wei Hao, Poggi Matteo, and Mattoccia Stefano IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) [Abs]
Under-display imaging has recently received considerable attention in both academia and industry. As a variation of this technique, under-display ToF (UD-ToF) cameras enable depth sensing for full-screen devices. However, it also brings problems of image blurring, signal-to-noise ratio and ranging accuracy reduction. To address these issues, we propose a cascaded deep network to improve the quality of UD-ToF depth maps. The network comprises two subnets, with the first using a complex-valued network in raw domain to perform denoising, deblurring and raw measurements enhancement jointly, while the second refining depth maps in depth domain based on the proposed multi-scale depth enhancement block (MSDEB). To enable training, we establish a data acquisition device and construct a real UD-ToF dataset by collecting real paired ToF raw data. Besides, we also build a large-scale synthetic UD-ToF dataset through noise analysis. The quantitative and qualitative evaluation results on public datasets and ours demonstrate that the presented network outperforms state-of-the-art algorithms and can further promote full-screen devices in practical applications.
Real-time Self-Supervised Monocular Depth Estimation Without GPU Poggi Matteo, Tosi Fabio, Aleotti Filippo, and Mattoccia Stefano IEEE Transactions on Intelligent Transportation Systems (T-ITS) [Abs]
Single-image depth estimation represents a longstanding challenge in computer vision and although it is an ill-posed problem, deep learning enabled astonishing results leveraging both supervised and self-supervised training paradigms. State-of-the-art solutions achieve remarkably accurate depth estimation from a single image deploying huge deep architectures, requiring powerful dedicated hardware to run in a reasonable amount of time. This overly demanding complexity makes them unsuited for a broad category of applications requiring devices with constrained resources or memory consumption. To tackle this issue, in this paper a family of compact, yet effective CNNs for monocular depth estimation is proposed, by leveraging self-supervision from a binocular stereo rig. Our lightweight architectures, namely PyD-Net and PyD-Net2, compared to complex state-of-the-art trade a small drop in accuracy to drastically reduce the runtime and memory requirements by a factor ranging from 2x to 100x. Moreover, our networks can run real-time monocular depth estimation on a broad set of embedded or consumer devices, even not equipped with a GPU, by early stopping the inference with negligible (or no) loss in accuracy, making it ideally suited for real applications with strict constraints on hardware resources or power consumption.
Monitoring social distancing with single image depth estimation Mingozzi Alessio, Conti Andrea, Aleotti Filippo, Poggi Matteo, and Mattoccia Stefano IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) [Abs]
The recent pandemic emergency raised many challenges regarding the countermeasures aimed at containing the virus spread, and constraining the minimum distance between people resulted in one of the most effective strategies. Thus, the implementation of autonomous systems capable of monitoring the so-called social distance gained much interest. In this paper, we aim to address this task leveraging a single RGB frame without additional depth sensors. In contrast to existing single-image alternatives failing when ground localization is not available, we rely on single image depth estimation to perceive the 3D structure of the observed scene and estimate the distance between people. During the setup phase, a straightforward calibration procedure, leveraging a scale-aware SLAM algorithm available even on consumer smartphones, allows us to address the scale ambiguity affecting single image depth estimation. We validate our approach through indoor and outdoor images employing a calibrated LiDAR + RGB camera asset. Experimental results highlight that our proposal enables sufficiently reliable estimation of the inter-personal distance to monitor social distancing effectively. This fact confirms that despite its intrinsic ambiguity, if appropriately driven single image depth estimation can be a viable alternative to other depth perception techniques, more expensive and not always feasible in practical applications. Our evaluation also highlights that our framework can run reasonably fast and comparably to competitors, even on pure CPU systems. Moreover, its practical deployment on low-power systems is around the corner.

2021

Continual Adaptation for Deep Stereo Poggi Matteo, Tonioni Alessio, Tosi Fabio, Mattoccia Stefano, and Di Stefano Luigi IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) [Abs]
Depth estimation from stereo images is carried out with unmatched results by convolutional neural networks trained end-to-end to regress dense disparities. Like for most tasks, this is possible if large amounts of labelled samples are available for training, possibly covering the whole data distribution encountered at deployment time. Being such an assumption systematically unmet in real applications, the capacity of adapting to any unseen setting becomes of paramount importance. Purposely, we propose a continual adaptation paradigm for deep stereo networks designed to deal with challenging and ever-changing environments. We design a lightweight and modular architecture, Modularly ADaptive Network (MADNet), and formulate Modular ADaptation algorithms (MAD, MAD++) which permit efficient optimization of independent sub-portions of the entire network. In our paradigm, the learning signals needed to continuously adapt models online can be sourced from self-supervision via right-to-left image warping or from traditional stereo algorithms. With both sources, no other data than the input images being gathered at deployment time are needed. Thus, our network architecture and adaptation algorithms realize the first real-time self-adaptive deep stereo system and pave the way for a new paradigm that can facilitate practical deployment of end-to-end architectures for dense disparity regression.
On the Synergies between Machine Learning and Binocular Stereo for Depth Estimation from Images: a Survey Poggi Matteo, Tosi Fabio, Batsos Konstantinos, Mordohai Philippos, and Mattoccia Stefano IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) [Abs]
Stereo matching is one of the longest-standing problems in computer vision with close to 40 years of studies and research. Throughout the years the paradigm has shifted from local, pixel-level decision to various forms of discrete and continuous optimization to data-driven, learning-based methods. Recently, the rise of machine learning and the rapid proliferation of deep learning enhanced stereo matching with new exciting trends and applications unthinkable until a few years ago. Interestingly, the relationship between these two worlds is two-way. While machine, and especially deep, learning advanced the state-of-the-art in stereo matching, stereo itself enabled new ground-breaking methodologies such as self-supervised monocular depth estimation based on deep networks. In this paper, we review recent research in the field of learning-based depth estimation from single and binocular images highlighting the synergies, the successes achieved so far and the open challenges the community is going to face in the immediate future.
On the confidence of stereo matching in a deep-learning era: a quantitative evaluation Poggi Matteo, Kim Seungryong, Tosi Fabio, Kim Sunok, Aleotti Filippo, Min Dongbo, Sohn Kwanghoon, and Mattoccia Stefano IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) [Abs]
Stereo matching is one of the most popular techniques to estimate dense depth maps by finding the disparity between matching pixels on two, synchronized and rectified images. Alongside with the development of more accurate algorithms, the research community focused on finding good strategies to estimate the reliability, i.e. the confidence, of estimated disparity maps. This information proves to be a powerful cue to naively find wrong matches as well as to improve the overall effectiveness of a variety of stereo algorithms according to different strategies. In this paper, we review more than ten years of developments in the field of confidence estimation for stereo matching. We extensively discuss and evaluate existing confidence measures and their variants, from hand-crafted ones to the most recent, state-of-the-art learning based methods. We study the different behaviors of each measure when applied to a pool of different stereo algorithms and, for the first time in literature, when paired with a state-of-the-art deep stereo network. Our experiments, carried out on five different standard datasets, provide a comprehensive overview of the field, highlighting in particular both strengths and limitations of learning-based strategies.
Energy-Quality Scalable Monocular Depth Estimation on Low-Power CPUs Cipolletta Antonio, Peluso Valentino, Calimera Andrea, Poggi Matteo, Tosi Fabio, Aleotti Filippo, and Mattoccia Stefano IEEE IoT Journal (IoT-J) [Abs]
The recent advancements in deep learning have demonstrated that inferring high-quality depth maps from a single image has become feasible and accurate thanks to Convolutional Neural Networks (CNNs), but how to process such compute- and memory-intensive models on portable and low-power devices remains a concern. Dynamic energy-quality scaling is an interesting yet less explored option in this field. It can improve efficiency through opportunistic computing policies where performances are boosted only when needed, achieving on average substantial energy savings. Implementing such a computing paradigm encompasses the availability of a scalable inference model, which is the target of this work. Specifically, we describe and characterize the design of an Energy-Quality scalable Pyramidal Network (EQPyD-Net), a lightweight CNN capable of modulating at run time the computational effort with minimal memory resources. We describe the architecture of the network and the optimization flow, covering the important aspects that enable the dynamic scaling, namely, the optimized training procedures, the compression stage via fixed-point quantization, and the code optimization for the deployment on commercial low-power CPUs adopted in the edge segment. To assess the effect of the proposed design knobs, we evaluated the prediction quality on the standard KITTI dataset and the energy and memory resources on the ARM Cortex-A53 CPU. The collected results demonstrate the flexibility of the proposed network and its energy efficiency. EQPyD-Net can be shifted across five operating points, ranging from a maximum accuracy of 82.2% with 0.4 Frame/J and up to 92.6% of energy savings with 6.1% of accuracy loss, still keeping a compact memory footprint of 5.2 MB for the weights and 38.3 MB (in the worst-case) for the processing.
Monocular Depth Perception on Microcontrollers for Edge Applications Peluso Valentino, Cipolletta Antonio, Calimera Andrea, Poggi Matteo, Tosi Fabio, Aleotti Filippo, and Mattoccia Stefano IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) [Abs]
Depth estimation is crucial in several computer vision applications, and a recent trend in this field aims at inferring such a cue from a single camera. Unfortunately, despite the compelling results achieved, state-of-the-art monocular depth estimation methods are computationally demanding, thus precluding their practical deployment in several application contexts characterized by low-power constraints. Therefore, in this paper, we propose a lightweight Convolutional Neural Network based on a shallow pyramidal architecture, referred to as μPyD-Net, enabling monocular depth estimation on microcontrollers. The network is trained in a peculiar self-supervised manner leveraging proxy labels obtained through a traditional stereo algorithm. Moreover, we propose optimization strategies aimed at performing computations with quantized 8-bit data and map the high-level description of the network to low-level layers optimized for the target microcontroller architecture. Exhaustive experimental results on standard datasets and an in-depth evaluation with a device belonging to the popular Arm Cortex-M family confirm that obtaining sufficiently accurate monocular depth estimation on microcontrollers is feasible. To the best of our knowledge, our proposal is the first one enabling such remarkable achievement, paving the way for the deployment of monocular depth cues onto the tiny end-nodes of distributed sensor networks.
Beyond the Baseline: 3D Reconstructionof Tiny Objects with Single CameraStereo Robot De Gregorio Daniele, Poggi Matteo, Zama Ramirez Pierluigi, Palli Gianluca, Mattoccia Stefano, and Di Stefano Luigi IEEE Access [Abs]
Self-aware robots rely on depth sensing to interact with the surrounding environment, e.g. to pursue object grasping. Yet, dealing with tiny items, often occurring in industrial robotics scenarios, may represent a challenge due to lack of sensors yielding sufficiently accurate depth measurements. Existing active sensors fail at measuring details of small objects (<1cm) because of limitations in the working range, e.g. usually beyond 50 cm away, while off-the-shelf stereo cameras are not suited to close-range acquisitions due to the need for extremely short baselines. Therefore, we propose a framework designed for accurate depth sensing and particularly amenable to reconstruction of miniature objects. By leveraging on a single camera mounted in eye-on-hand configuration and the high repeatability of a robot, we acquire multiple images and process them through a stereo algorithm revised to fully exploit multiple vantage points. Using a novel dataset addressing performance evaluation in industrial applications, our Single camera Stereo Robot (SiSteR) delivers high accuracy even when dealing with miniature objects. We will provide a public dataset and an open-source implementation of our proposal to foster further development in this field.
Real-time single image depth perception in the wild with handheld devices Aleotti Filippo, Zaccaroni Giulio, Bartolomei Luca, Poggi Matteo, Tosi Fabio, and Mattoccia Stefano MDPI Sensors [Abs]
Depth perception is paramount to tackle real-world problems, ranging from autonomous driving to consumer applications. For the latter, depth estimation from a single image would represent the most versatile solution since a standard camera is available on almost any handheld device. Nonetheless, two main issues limit the practical deployment of monocular depth estimation methods on such devices: i) the low reliability when deployed in the wild and ii) the demanding resource needed to achieve real-time performance, often not compatible with low-power embedded systems. Therefore, in this paper, we deeply investigate all these issues showing how they are both addressable by adopting appropriate network design and training strategies. Moreover, we also outline how to map the resulting networks on handheld devices to achieve real-time performance. Our thorough evaluation highlights the ability of such fast networks to generalize well to new environments, a crucial feature required to tackle the extremely varied contexts faced in real applications. Indeed, to further support this evidence, we report experimental results concerning real-time depth-aware augmented reality and image blurring with smartphones in the wild.
On the Deployment of Out-of-the-Box Embedded Devices for Self-Powered River Surface Flow Velocity Monitoring at the Edge Livoroi Arsal-Hanif, Conti Andrea, Foianesi Luca, Tosi Fabio, Aleotti Filippo, Poggi Matteo, Tauro Flavia, Toth Elena, Grimaldi Salvatore, and Mattoccia Stefano Applied Sciences
A computer vision approach based on deep learning for the detection of dairy cows in free stall barn Tassinari Patrizia, Bovo Marco, Benni Stefano, Franzoni Simone, Poggi Matteo, mammi Maria Ludovica Eugenia, Mattoccia Stefano, Di Stefano Luigi, Bonora Filippo, Barbaresi Alberto, Santolini Enrica, and Torreggiani Daniele Computers and Electronics in Agriculture [Abs]
As reported in the recent image velocimetry literature, tracking the motion of sparse feature points floating on the river surface as done by the Optical Tracking Velocimetry (OTV) algorithm is a promising strategy to address surface flow monitoring. Moreover, the lightweight nature of OTV coupled with computational optimizations makes it suited even for its deployment in situ to perform measurements at the edge with cheap embedded devices without the need to perform offload processing. Despite these notable achievements, the actual practical deployment of OTV in remote environments would require cheap and self-powered systems enabling continuous measurements without the need for cumbersome and expensive infrastructures rarely found in situ. Purposely, in this paper, we propose an additional simplification to the OTV algorithm to reduce even further its computational requirements, and we analyze self-powered off-the-shelf setups for in situ deployment. We assess the performance of such set-ups from different perspectives to determine the optimal solution to design a cost-effective self-powered measurement node.

2020

Unsupervised Domain Adaptation for Depth Prediction from Images Tonioni Alessio, Poggi Matteo, Mattoccia Stefano, and Di Stefano Luigi IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) [Abs]
State-of-the-art methods to infer dense and accurate depth measurements from images rely on deep CNN models trained in an end-to-end fashion on a significant amount of data. However, despite the outstanding performance achieved, these frameworks suffer a drastic drop in accuracy when dealing with unseen environments much different, concerning appearance (e.g., synthetic vs. real) or context (e.g., indoor vs. outdoor), from those observed during the training phase. Such domain shift issue is usually softened by fine-tuning on smaller sets of images with depth labels acquired in the target domain with active sensors (e.g., LiDAR). However, relying on such supervised labeled data is seldom feasible in practical applications. Therefore, we propose an effective unsupervised domain adaptation technique enabling to overcome the domain shift problem without requiring any groundtruth label. Our method, deploying much more accessible to obtain stereo pairs, leverages traditional and not learning-based stereo algorithms to produce disparity/depth labels and on confidence measures to assess their degree of reliability. With these cues, we can fine-tune deep models through a novel confidence-guided loss function, neglecting the effect of outliers gathered from the output of conventional stereo algorithms.
Learning a confidence measure in the disparity domain from O (1) features Poggi Matteo, Tosi Fabio, and Mattoccia Stefano Computer Vision and Image Understanding (CVIU) [Abs]
Depth sensing is of paramount importance for countless applications and stereo represents a popular, effective and cheap solution for this purpose. As highlighted by recent works concerned with stereo, uncertainty estimation can be a powerful cue to improve accuracy in stereo. Most confidence measures rely on features, mainly extracted from the cost volume, fed to a random forest or a convolutional neural network trained to estimate match uncertainty. In contrast, we propose a novel strategy for confidence estimation based on features computed in the disparity domain, making our proposal suited for any stereo system including COTS devices, and in constant time. We exhaustively assess the performance of our proposals, referred to as O1 and O2, on KITTI and Middlebury datasets with three popular and different stereo algorithms (CENSUS, MC-CNN and SGM), as well as a deep stereo network (PSM-Net). We also evaluate how well confidence measures generalize to different environments/datasets.
Confidence Estimation for ToF and Stereo Sensors and its Application to Depth Data Fusion Poggi Matteo, Agresti Gianluca, Tosi Fabio, Zanuttigh Pietro, and Mattoccia Stefano IEEE Sensors Journal [Abs]
Time-of-Flight (ToF) sensors and stereo vision systems are two widely used technologies for depth estimation. Due to their rather complementary strengths and limitations, the two sensors are often combined to infer more accurate depth maps. A key research issue in this field is how to estimate the reliability of the sensed depth data. While this problem has been widely studied for stereo systems, it has been seldom considered for ToF sensors. Therefore, starting from the work done for stereo data, in this paper, we firstly introduce novel confidence estimation techniques for ToF data. Moreover, we also show how by using learning-based confidence metrics jointly trained on the two sensors yields better performance. Finally, deploying different fusion frameworks, we show how confidence estimation can be exploited in order to guide the fusion of depth data from the two sensors. Experimental results show how accurate confidence cues allow outperforming state-of-the-art data fusion schemes even with the simplest fusion strategies known in the literature.
Good cues to learn from scratch a confidence measure for passive depth sensors Poggi Matteo, Tosi Fabio, and Mattoccia Stefano IEEE Sensors Journal [Abs]
As reported in the stereo literature, confidence estimation represents a powerful cue to detect outliers as well as to improve depth accuracy. Purposely, we proposed a strategy enabling us to achieve state-of-the-art results by learning a confidence measure in the disparity domain only with a CNN. Since this method does not require the cost volume, it is very appealing because potentially suited for any depth-sensing technologies, including, for instance, those based on deep networks. By following this intuition, in this paper, we deeply investigate the performance of confidence estimation methods, known in the literature and new ones proposed in this paper, neglecting the use of the cost volume. Specifically, we estimate from scratch confidence measures feeding deep networks with raw depth estimates and optionally images and assess their performance deploying three datasets and three stereo algorithms. We also investigate, for the first time, their performance with disparity maps inferred by deep stereo end-to-end architectures. Moreover, we move beyond the stereo matching context, estimating confidence from depth maps generated by a monocular network. Our extensive experiments with different architectures highlight that inferring confidence prediction from the raw reference disparity only, as proposed in our previous work, is not only the most versatile solution but also the most effective one in most cases.
Enabling image-based streamflow monitoring at the edge Tosi Fabio, Rocca Matteo, Aleotti Filippo, Poggi Matteo, Mattoccia Stefano, Tauro Flavia, Toth Elena, and Grimaldi Salvatore MDPI Remote Sensing [Abs]
Monitoring streamflow velocity is of paramount importance for water resources management and in engineering practice. To this aim, image-based approaches have proved to be reliable systems to non-intrusively monitor water bodies in remote places at variable flow regimes. Nonetheless, to tackle their computational and energy requirements, offload processing and high-speed internet connections in the monitored environments, which are often difficult to access, is mandatory hence limiting the effective deployment of such techniques in several relevant circumstances. In this paper, we advance and simplify streamflow velocity monitoring by directly processing the image stream in situ with a low-power embedded system. By leveraging its standard parallel processing capability and exploiting functional simplifications, we achieve an accuracy comparable to state-of-the-art algorithms that typically require expensive computing devices and infrastructures. The advantage of monitoring streamflow velocity in situ with a lightweight and cost-effective embedded processing device is threefold. First, it circumvents the need for wideband internet connections, which are expensive and impractical in remote environments. Second, it massively reduces the overall energy consumption, bandwidth and deployment cost. Third, when monitoring more than one river section, processing “at the very edge” of the system efficiency improves scalability by a large margin, compared to offload solutions based on remote or cloud processing. Therefore, enabling streamflow velocity monitoring in situ with low-cost embedded devices would foster the widespread diffusion of gauge cameras even in developing countries where appropriate infrastructure might be not available or too expensive.

Proceedings [83]

2025

Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail Bartolomei Luca, Tosi Fabio, Poggi Matteo, and Mattoccia Stefano In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [Abs]
We introduce Stereo Anywhere, a novel stereo-matching framework that combines geometric constraints with robust priors from monocular depth Vision Foundation Models (VFMs). By elegantly coupling these complementary worlds through a dual-branch architecture, we seamlessly integrate stereo matching with learned contextual cues. Following this design, our framework introduces novel cost volume fusion mechanisms that effectively handle critical challenges such as textureless regions, occlusions, and non-Lambertian surfaces. Through our novel optical illusion dataset, MonoTrap, and extensive evaluation across multiple benchmarks, we demonstrate that our synthetic-only trained model achieves state-of-the-art results in zero-shot generalization, significantly outperforming existing solutions while showing remarkable robustness to challenging cases such as mirrors and transparencies.
Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation Qorbani Reza, Villani Gianluca, Panagiotakopoulos Theodoros, Botet Colomer Marc, Härenstam-Nielsen Linus, Segu Mattia, Dovesi Pier Luigi, Karlgren Jussi, Cremers Daniel, Tombari Federico, and Poggi Matteo In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [Abs]
Open-vocabulary semantic segmentation models associate vision and text to label pixels from an undefined set of classes using textual queries, providing versatile performance on novel datasets. However, large shifts between training and test domains degrade their performance, requiring fine-tuning for effective real-world applications. We introduce Semantic Library Adaptation (SemLA), a novel framework for training-free, test-time domain adaptation. SemLA leverages a library of LoRA-based adapters indexed with CLIP embeddings, dynamically merging the most relevant adapters based on proximity to the target domain in the embedding space. This approach constructs an ad-hoc model tailored to each specific input without additional training. Our method scales efficiently, enhances explainability by tracking adapter contributions, and inherently protects data privacy, making it ideal for sensitive applications. Comprehensive experiments on a 20-domain benchmark built over 10 standard datasets demonstrate SemLA ’s superior adaptability and performance across diverse settings, establishing a new standard in domain adaptation for open-vocabulary semantic segmentation.
Learning Temporally Consistent Video Depth from Video Diffusion Priors Shao Jiahao, Yang Yuanbo, Zhou Hongyu, Zhang Youmin, Shen Yujun, Guizilini Vitor, Wang Yue, Poggi Matteo, and Liao Yiyi In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [Abs]
This work addresses the challenge of streamed video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. We argue that no contextual information shared between frames or clips is pivotal in fostering inconsistency. Instead of directly developing a depth estimator from scratch, we reformulate this predictive task into a conditional generation problem to provide contextual information within a clip and across clips. Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context. We sample independent noise levels for each frame within a clip during training while using a sliding window strategy and initializing overlapping frames with previously predicted frames without adding noise. Moreover, We design an effective training strategy to provide context within a clip. Extensive experimental results validate our design choices and demonstrate the superiority of our approach, dubbed ChronoDepth.
Self-supervised Monocular Depth Estimation for Dynamic Objects with Ground Propagation Li Huan, Poggi Matteo, Tosi Fabio, and Mattoccia Stefano In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) [Abs]
Self-supervised single-view depth estimation, trained on video sequences, faces significant challenges when dynamic objects are present in the training data, as they violate the basic multi-view geometry assumptions used to compute photometric losses. We propose a novel approach that leverages the relationship between the depth of moving objects and their ground contact points. By iteratively propagating ground features to moving targets in perceptual layers, we recalibrate the depth of dynamic entities while preserving details. Our method maintains the end-to-end training paradigm without additional networks or complex training procedures. Our experiments demonstrate that our method achieves state-of-the-art performance when estimating depth for dynamic objects and attains superior generalization compared to existing approaches.
HS-SLAM: Hybrid Representation with Structural Supervision for Improved Dense SLAM Gong Ziren, Tosi Fabio, Zhang Youming, Mattoccia Stefano, and Poggi Matteo In International Conference on Robotics and Automation (ICRA) [Abs]
We present HS-SLAM, a framework marrying a Hybrid representation with Structural supervision for dense RGB-D SLAM. The former leverages the complementary advantages of hash grid, tri-planes and one-blob, achieving superior scene reconstruction. For the latter, we incorporate patch rendering loss, enabling HS-SLAM to better capture structural information and perceive changes in unobserved regions. Finally, we introduce an active global BA to allocate more samples to historical observation where new regions emerge or scenes are being forgotten. This avoids re-training well-optimized scenes while eliminating camera drift and mitigating accumulative errors. We evaluate our method on the three challenging datasets, Replica, ScanNet, and TUM. HS-SLAM demonstrates superior scene representation capabilities, effectively capturing intricate textures and achieving higher completeness in challenging areas. Furthermore, HS-SLAM highlights its effects in reducing cumulative camera pose errors, specifically when handling newly explored scenes or scenes being forgotten.
Drive with the Flow Mannocci Enrico, Poggi Matteo, and Mattoccia Stefano In International Conference on Robotics and Automation (ICRA) [Abs]
End-to-end autonomous driving systems have recently made rapid progress, thanks to simulators such as CARLA. They can drive without infraction of common driving rules on uncongested roads but are still struggling with dense traffic scenarios. We conjecture that this occurs because it lacks understanding of the dynamics of the surrounding vehicles, caused by the absence of explicit short-term memory within the perception path of end-to-end models. To address this challenge, we revise the perception module to explicitly model temporal information, by extending it with an auxiliary task that is well-known in computer vision research: optical flow. We generate a novel benchmark using the CARLA simulator to train our model, FlowFuser, and prove its superior ability to avoid collisions with other agents on the road.
LightStereo: Channel Boost Is All Your Need for Efficient 2D Cost Aggregation Guo Xianda, Zhang Chenming, Zhang Youmin, Zheng Wenzhao, Nie Dujun, Poggi Matteo, and Chen Long In International Conference on Robotics and Automation (ICRA) [Abs]
We present LightStereo, a cutting-edge stereo-matching network crafted to accelerate the matching process. Departing from conventional methodologies that rely on aggregating computationally intensive 4D costs, LightStereo adopts the 3D cost volume as a lightweight alternative. While similar approaches have been explored previously, our breakthrough lies in enhancing performance through a dedicated focus on the channel dimension of the 3D cost volume, where the distribution of matching costs is encapsulated. Our exhaustive exploration has yielded plenty of strategies to amplify the capacity of the pivotal dimension, ensuring both precision and efficiency. We compare the proposed LightStereo with existing state-of-the-art methods across various benchmarks, which demonstrate its superior performance in speed, accuracy, and resource utilization. LightStereo achieves a competitive EPE metric in the SceneFlow datasets while demanding a minimum of only 22 GFLOPs and 17 ms of runtime, and ranks 1st on KITTI 2015 among real-time models. Our comprehensive analysis reveals the effect of 2D cost aggregation for stereo matching, paving the way for real-world applications of efficient stereo systems. Code will be available at https://github.com/XiandaGuo/OpenStereo.
CabNIR: A Benchmark for In-Vehicle Infrared Monocular Depth Estimation Cavalcanti Ugo Leone, Poggi Matteo, Tosi Fabio, Cambareri Valerio, Zlokolica Vladimir, and Mattoccia Stefano In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) [Abs]
Accurate in-cabin depth estimation is critical for advancing automotive safety and occupant comfort. However, existing datasets for in-vehicle scene understanding tasks often fall short in providing sufficient information and scale needed to evaluate existing depth estimation methods. In this paper, we present a novel benchmark tailored for monocular depth estimation in vehicle interiors, containing both near-infrared (NIR) images and corresponding ground truth depth data. Featuring over 41,000 frames captured across 36 distinct vehicles and 45 different passengers, it offers an unprecedented level of variability for this application domain. Evaluation on our testbench of cutting-edge single-view depth models in different flavors, including zero-shot affine-invariant depth estimation or in-domain specialization, reveals that current depth estimation approaches, while promising, still have a significant performance gap to overcome before achieving the reliability required for downstream safety-critical applications. In light of its diverse range and complex scenarios, we believe this benchmark could serve as a common reference for further research concerning in-cabin monocular depth estimation.

2024

Self-Evolving Depth-Supervised 3D Gaussian Splatting from Rendered Stereo Pairs Safadoust Sadra, Tosi Fabio, Güney Fatma, and Poggi Matteo In British Machine Vision Conference (BMVC) [Abs]
3D Gaussian Splatting (GS) significantly struggles to faithfully represent the underlying 3D scene geometry, resulting in inaccuracies and floating artifacts when rendering depth maps. In this paper, we address this limitation, undertaking a comprehensive analysis of the integration of depth priors throughout the optimization process of Gaussian primitives, and present a novel strategy for this purpose. This latter dynamically exploits depth cues from a readily available stereo network, processing virtual stereo pairs rendered by the GS model itself during training and achieving consistent self-improvement of the scene representation. Experimental results on three popular datasets, breaking ground as the first to assess depth accuracy for these models, validate our findings.
LiDAR-Event Stereo Fusion with Hallucinations Bartolomei Luca, Poggi Matteo, Conti Andrea, and Mattoccia Stefano In European Conference on Computer Vision (ECCV) [Abs]
Event stereo matching is an emerging technique to estimate depth from neuromorphic cameras; however, events are unlikely to trigger in the absence of motion or the presence of large, untextured regions, making the correspondence problem extremely challenging. Purposely, we propose integrating a stereo event camera with a fixed-frequency active sensor – e.g., a LiDAR – collecting sparse depth measurements, overcoming the aforementioned limitations. Such depth hints are used by hallucinating – i.e., inserting fictitious events – the stacks or raw input streams, compensating for the lack of information in the absence of brightness changes. Our techniques are general, can be adapted to any structured representation to stack events and outperform state-of-the-art fusion methods applied to event-based stereo.
Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions Tosi Fabio, Zama Ramirez Pierluigi, and Poggi Matteo In European Conference on Computer Vision (ECCV) [Abs]
We present a novel approach designed to address the complexities posed by challenging, out-of-distribution data in the single-image depth estimation task. Our method leverages cutting-edge conditioned diffusion models with depth-aware controls to generate new, user-defined scenes with a comprehensive set of challenges and associated depth information. By starting with images that facilitate depth prediction due to the absence of unfavorable factors, we systematically generate new, user-defined scenes with a comprehensive set of challenges and associated depth information. This is achieved by leveraging cutting-edge conditioned diffusion models with depth-aware controls, known for their ability to synthesize high-quality image content from textual prompts while preserving the coherence of the 3D structure between generated and source imagery. Subsequent fine-tuning of any monocular depth network is carried out through a self-distillation protocol that takes into account images generated using our strategy and its own depth predictions on simple, unchallenging scenes. Experiments on benchmarks tailored for our purposes demonstrate the effectiveness and versatility of our proposal.
Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor Conti Andrea, Poggi Matteo, Cambareri Valerio, and Mattoccia Stefano In European Conference on Computer Vision (ECCV) [Abs]
High frame rate and accurate depth estimation plays an important role in several tasks crucial to robotics and automotive perception. To date, this can be achieved through ToF and LiDAR devices for indoor and outdoor applications, respectively. However, their applicability is limited by low frame rate, energy consumption, and spatial sparsity. Depth on Demand (DoD) allows for accurate temporal and spatial depth densification achieved by exploiting a high frame rate RGB sensor coupled with a potentially lower frame rate and sparse active depth sensor. Our proposal jointly enables lower energy consumption and denser shape reconstruction, by significantly reducing the streaming requirements on the depth sensor thanks to its three core stages: i) multi-modal encoding, ii) iterative multi-modal integration, and iii) depth decoding. We present extended evidence assessing the effectiveness of DoD on indoor and outdoor video datasets, covering both environment scanning and automotive perception use cases.
TRICKY 2024 Challenge on Monocular Depth from Images of Specular and Transparent Surfaces Zama Ramirez Pierluigi, Costanzino Alex, Tosi Fabio, Poggi Matteo, Di Stefano Luigi, Weibel Jean-Baptiste, Bauer Dominik, Antensteiner Doris, Vincze Markus, Li Jiaqi, Huang Yachuan, Zhang Junrui, Wang Yiran, Zheng Jinghong, Shen Liao, Cao Zhiguo, Song Ziyang, Wang Zerong, Zhu Ruijie, Zhang Hao, Li Rui, Wu Jiang, Li Xian, Zhu Yu, Sun Jinqiu, Zhang Yanning, Sun Pihai, Yao Yuanqi, Zhao Wenbo, Jiang Kui, Jiang Junjun, Lavreniuk Mykola, and Wang Jui-Lin In European Conference on Computer Vision Workshops (ECCVW)
Exploring Few-Beam LiDAR Assistance in Self-Supervised Multi-Frame Depth Estimation Fan Rizhao, Poggi Matteo, and Mattoccia Stefano In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) [Abs]
Self-supervised multi-frame depth estimation methods only require unlabeled monocular videos for training. However, most existing methods face challenges, including accuracy degradation caused by moving objects in dynamic scenes and scale ambiguity due to the absence of real-world references. In this field, the emergence of low-cost LiDAR sensors highlights the potential to improve the robustness of multi-frame depth estimation by exploiting accurate sparse measurements at the correct scale. Moreover, the LiDAR ranging points often intersect moving objects, providing more precise depth cues for them. This paper explores the impact of few-beam LiDAR data on self-supervised multi-frame depth estimation, proposing a method that fuses multi-frame matching and sparse depth features. It significantly enhances depth estimation robustness, particularly in scenarios involving moving objects and textureless backgrounds. We demonstrate the effectiveness of our approach through comprehensive experiments, showcasing its potential to address the limitations of existing methods and paving the way for more robust and reliable depth estimation based on this paradigm.
MaskingDepth: Masked Consistency Regularization for Semi-Supervised Monocular Depth Estimation Baek Jongbeom, Kim Gyeongnyeon, Park Seonghoon, An Honggyu, Poggi Matteo, and Kim Seungryong In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) [Abs]
We propose MaskingDepth, a semi-supervised learning framework for monocular depth estimation. MaskingDepth is designed to enforce consistency between the depths obtained from strongly-augmented images and the pseudo-depths derived from weakly-augmented images, which enables mitigating the reliance on large ground-truth depth quantities. In this framework, we leverage uncertainty estimation to only retain high-confident depth predictions from the weakly-augmented branch as pseudo-depths. We also present a novel data augmentation, dubbed K-way disjoint masking, that takes advantage of a naïve token masking strategy as an augmentation, while avoiding its scale ambiguity problem between depths from weakly- and strongly-augmented branches and risk of missing small-scale objects. Experiments on KITTI and NYU-Depth-v2 datasets demonstrate the effectiveness of each component, its robustness to the use of fewer depth-annotated images, and superior performance compared to other state-of-the-art semi-supervised learning methods for monocular depth estimation.
Federated Online Adaptation for Deep Stereo Poggi Matteo, and Tosi Fabio In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [Abs]
We introduce a novel approach for adapting deep stereo networks in a collaborative manner. By building over principles of federated learning, we develop a distributed framework allowing for demanding the optimization process to a number of clients deployed in different environments. This makes it possible, for a deep stereo network running on resourced-constrained devices, to capitalize on the adaptation process carried out by other instances of the same architecture, and thus improve its accuracy in challenging environments even when it cannot carry out adaptation on its own. Experimental results show how federated adaptation performs equivalently to on-device adaptation, and even better when dealing with challenging environments.
Revisiting Depth Completion from a Stereo Matching Perspective for Cross-domain Generalization Bartolomei Luca, Poggi Matteo, Conti Andrea, Tosi Fabio, and Mattoccia Stefano In International Conference on 3D Vision (3DV) [Abs]
This paper proposes a new framework for depth completion robust against domain-shifting issues. It exploits the generalization capability of modern stereo networks to face depth completion, by processing fictitious stereo pairs obtained through a virtual pattern projection paradigm. Any stereo network or traditional stereo matcher can be seamlessly plugged into our framework, allowing for the deployment of a virtual stereo setup that is future-proof against advancement in the stereo field. Exhaustive experiments on cross-domain generalization support our claims. Hence, we argue that our framework can help depth completion to reach new deployment scenarios.
Range-Agnostic Multi-View Depth Estimation with Keyframe Selection Conti Andrea, Poggi Matteo, Cambareri Valerio, and Mattoccia Stefano In International Conference on 3D Vision (3DV) [Abs]
Methods for 3D reconstruction from posed frames require prior knowledge about the scene metric range, usually to recover matching cues along the epipolar lines and narrow the search range. However, such prior might not be directly available or estimated inaccurately in real scenarios – e.g., outdoor 3D reconstruction from video sequences – therefore heavily hampering performance. In this paper, we focus on multi-view depth estimation without requiring prior knowledge about the metric range of the scene by proposing \netname, an efficient and purely 2D framework that reverses the depth estimation and matching steps order. Moreover, we demonstrate the capability of our framework to provide rich insights about the quality of the views used for prediction. Additional material can be found on our \hrefhttps://andreaconti.github.io/projects/range_agnostic_multi_view_depthproject page.
The Third Monocular Depth Estimation Challenge Spencer Jaime, Tosi Fabio, Poggi Matteo, Arora Ripudaman Singh, Russell Chris, Hadfield Simon, Bowden Richard, Zhou GuangYuan, Li ZhengXin, Rao Qiang, Bao YiPing, Liu Xiao, Kim Dohyeong, Kim Jinseong, Kim Myunghyun, Lavreniuk Mykola, Li Rui, Mao Qing, Wu Jiang, Zhu Yu, Sun Jinqiu, Zhang Yanning, Patni Suraj, Agarwal Aradhye, Arora Chetan, Sun Pihai, Jiang Kui, Wu Gang, Liu Jian, Liu Xianming, Jiang Junjun, Zhang Xidan, Wei Jianing, Wang Fangjun, Tan Zhiming, Wang Jiabao, Luginov Albert, Shahzad Muhammad, Hosseini Seyed, Trajcevski Aleksander, and Elder James H. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) [Abs]
This paper discusses the results of the third edition of the Monocular Depth Estimation Challenge (MDEC). The challenge focuses on zero-shot generalization to the challenging SYNS-Patches dataset, featuring complex scenes in natural and indoor settings. As with the previous edition, methods can use any form of supervision, i.e. supervised or self-supervised. The challenge received a total of 19 submissions outperforming the baseline on the test set: 10 among them submitted a report describing their approach, highlighting a diffused use of foundational models such as Depth Anything at the core of their method. The challenge winners drastically improved 3D F-Score performance, from 17.51% to 23.72%.
NTIRE 2024 Challenge on HR Depth from Images of Specular and Transparent Surfaces Zama Ramirez Pierluigi, Tosi Fabio, Di Stefano Luigi, Timofte Radu, Costanzino Alex, Poggi Matteo, and others In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (CVPRW) [Abs]
This paper reports on the NTIRE 2024 challenge on HR Depth From images of Specular and Transparent surfaces, held in conjunction with the New Trends in Image Restoration and Enhancement (NTIRE) workshop at CVPR 2024. This challenge aims to advance the research on depth estimation, specifically to address two of the main open issues in the field: high-resolution and non-Lambertian surfaces. The challenge proposes two tracks on stereo and single-image depth estimation, attracting about 120 registered participants. In the final testing stage, 2 and 8 participating teams submitted their models and fact sheets for the two tracks.

2023

To Adapt or Not to Adapt? Real-Time Adaptation for Semantic Segmentation Botet Colomer Marc, Dovesi Pier Luigi, Panagiotakopoulos Theodoros, Carvalho Joao Frederico, Härenstam-Nielsen Linus, Azizpour Hossein, Kjellström Hedvig, Cremers Daniel, and Poggi Matteo In IEEE International Conference on Computer Vision (ICCV) [Abs]
The goal of Online Domain Adaptation for semantic segmentation is to handle unforeseeable domain changes that occur during deployment, like sudden weather events. However, the high computational costs associated with brute-force adaptation make this paradigm unfeasible for real-world applications. In this paper we propose HAMLET, a Hardware-Aware Modular Least Expensive Training framework for real-time domain adaptation. Our approach includes a hardware-aware back-propagation orchestration agent (HAMT) and a dedicated domain-shift detector that enables active control over when and how the model is adapted (LT). Thanks to these advancements, our approach is capable of performing semantic segmentation while simultaneously adapting at more than 29FPS on a single consumer-grade GPU. Our framework’s encouraging accuracy and speed trade-off is demonstrated on OnDA and SHIFT benchmarks through experimental results.
Learning Depth Estimation for Transparent and Mirror Surfaces Costanzino Alex, Zama Ramirez Pierluigi, Poggi Matteo, Tosi Fabio, Mattoccia Stefano, and Di Stefano Luigi In IEEE International Conference on Computer Vision (ICCV) [Abs]
Inferring the depth of transparent or mirror (ToM) surfaces represents a hard challenge for either sensors, algorithms, or deep networks. We propose a simple pipeline for learning to estimate depth properly for such surfaces with neural networks, without requiring any ground-truth annotation. We unveil how to obtain reliable pseudo labels by in-painting ToM objects in images and processing them with a monocular depth estimation model. These labels can be used to fine-tune existing monocular or stereo networks, to let them learn how to deal with ToM surfaces. Experimental results on the Booster dataset show the dramatic improvements enabled by our remarkably simple proposal.
Active Stereo Without Pattern Projector Bartolomei Luca, Poggi Matteo, Tosi Fabio, Conti Andrea, and Mattoccia Stefano In IEEE International Conference on Computer Vision (ICCV) [Abs]
This paper proposes a novel framework integrating the principles of active stereo in standard passive camera systems without a physical pattern projector. We virtually project a pattern over the left and right images according to the sparse measurements obtained from a depth sensor. Any such devices can be seamlessly plugged into our framework, allowing for the deployment of a virtual active stereo setup in any possible environment, overcoming the limitation of pattern projectors, such as limited working range or environmental conditions. Experiments on indoor/outdoor datasets, featuring both long and close-range, support the seamless effectiveness of our approach, boosting the accuracy of both stereo algorithms and deep networks.
GO-SLAM: Global Optimization for Consistent 3D Instant Reconstruction Zhang Youmin, Tosi Fabio, Mattoccia Stefano, and Poggi Matteo In IEEE International Conference on Computer Vision (ICCV) [Abs]
Neural implicit representations have recently demonstrated compelling results on dense Simultaneous Localization And Mapping (SLAM) but suffer from the accumulation of errors in camera tracking and distortion in the reconstruction. Purposely, we present GO-SLAM, a deep-learning-based dense visual SLAM framework globally optimizing poses and 3D reconstruction in real-time. Robust pose estimation is at its core, supported by efficient loop closing and online full bundle adjustment, which optimize per frame by utilizing the learned global geometry of the complete history of input frames. Simultaneously, we update the implicit and continuous surface representation on-the-fly to ensure global consistency of 3D reconstruction. Results on various synthetic and real-world datasets demonstrate that GO-SLAM outperforms state-of-the-art approaches at tracking robustness and reconstruction accuracy. Furthermore, GO-SLAM is versatile and can run with monocular, stereo, and RGB-D input.
GasMono: Geometry-Aided Self-Supervised Monocular Depth Estimation for Indoor Scenes Zhao Chaoqiang, Poggi Matteo, Tosi Fabio, Zhou Lei, Sun Qiyu, Tang Yang, and Mattoccia Stefano In IEEE International Conference on Computer Vision (ICCV) [Abs]
This paper tackles the challenges of self-supervised monocular depth estimation in indoor scenes caused by large rotation between frames and low texture. We ease the learning process by obtaining coarse camera poses from monocular sequences through multi-view geometry to deal with the former. However, we found that limited by the scale ambiguity across different scenes in the training dataset, a naïve introduction of geometric coarse poses cannot play a positive role in performance improvement, which is counter-intuitive. To address this problem, we propose to refine those poses during training through rotation and translation/scale optimization. To soften the effect of the low texture, we combine the global reasoning of vision transformers with an overfitting-aware, iterative self-distillation mechanism, providing more accurate depth guidance coming from the network itself. Experiments on NYUv2, ScanNet, 7scenes, and KITTI datasets support the effectiveness of each component in our framework, which sets a new state-of-the-art for indoor self-supervised monocular depth estimation, as well as outstanding generalization ability. Code and models are available at https://github.com/zxcqlf/GasMono
On-Site Adaptation for Monocular Depth Estimation with a Static Camera Li Huan, Poggi Matto, Tosi Fabio, and Mattoccia Stefano In British Machine Vision Conference (BMVC) [Abs]
We introduce a novel technique for easing the deployment of an off-the-shelf monocular depth estimation network in unseen environments. Specifically, we target a very diffused setting with a fixed camera mounted higher over the ground to monitor an environment and highlight the limitations of state-of-the-art monocular networks deployed in such a setup. Purposely, we develop an on-site adaptation technique capable of 1) improving the accuracy of estimated depth maps in the presence of moving subjects, such as pedestrians, cars, and others; 2) refining the overall structure of the predicted depth map, to make it more consistent with the real 3D structure of the scene; 3) recovering absolute metric depth, usually lost by state-of-the-art solutions. Experiments on synthetic and real datasets confirm the effectiveness of our proposal.
Lightweight Self-Supervised Depth Estimation with few-beams LiDAR Data Fan Rizhao, Tosi Fabio, Poggi Matteo, and Mattoccia Stefano In British Machine Vision Conference (BMVC) [Abs]
This paper proposes a lightweight yet effective self-supervised depth completion network trained on monocular videos and sparse raw LiDAR measurements only. Specifically, we utilize a multi-stage network architecture, which depends on cheap CNN layers. We introduce a novel guided sparse convolution operator combining sparse and dense data to extract depth features. To mitigate the impact of outliers commonly present in the sparse raw LiDAR data, we adopt a distance-dependent outlier mask that incorporates an elastic threshold mechanism to selectively discard such points. Our experimental results on the KITTI dataset show the favorable trade-off between accuracy and efficiency achieved by our model, reaching state-of-the-art performance on self-supervised depth estimation from few-beams LiDAR (4-beams), depth completion (64-beams) and a few hundred depth points, using a fraction of the parameters. Our code will be available on \urlhttps://github.com/franky-ciomp/GSCNN/.
TemporalStereo: Efficient Spatial-Temporal Stereo Matching Network Zhang Youmin, Poggi Matteo, and Mattoccia Stefano In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) [Abs]
We present TemporalStereo, a coarse-to-fine stereo matching network that is highly efficient, and able to effectively exploit the past geometry and context information to boost matching accuracy. Our network leverages sparse cost volume and proves to be effective when a single stereo pair is given. However, its peculiar ability to use spatio-temporal information across stereo sequences allows TemporalStereo to alleviate problems such as occlusions and reflective regions while enjoying high efficiency also in this latter case. Notably, our model – trained once with stereo videos – can run in both single-pair and temporal modes seamlessly. Experiments show that our network relying on camera motion is robust even to dynamic objects when running on videos. We validate TemporalStereo through extensive experiments on synthetic (SceneFlow, TartanAir) and real (KITTI 2012, KITTI 2015) datasets. Our model achieves state-of-the-art performance on any of these datasets.
Depth self-supervision for single image novel view synthesis Minelli Giovanni, Poggi Matteo, and Salti Samuele In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) [Abs]
In this paper, we tackle the problem of generating a novel image from an arbitrary viewpoint given a single frame as input. While existing methods operating in this setup aim at predicting the target view depth map to guide the synthesis, without explicit supervision over such a task, we jointly optimize our framework for both novel view synthesis and depth estimation to unleash the synergy between the two at its best. Specifically, a shared depth decoder is trained in a self-supervised manner to predict depth maps that are consistent across the source and target views. Our results demonstrate the effectiveness of our approach in addressing the challenges of both tasks allowing for higher-quality generated images, as well as more accurate depth for the target viewpoint.
NeRF-Supervised Deep Stereo Tosi Fabio, Tonioni Alessio, De Gregorio Daniele, and Poggi Matteo In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [Abs]
We introduce a novel framework for training deep stereo networks effortlessly and without any ground-truth. By leveraging state-of-the-art neural rendering solutions, we generate stereo training data from image sequences collected with a single handheld camera. On top of them, a NeRF-supervised training procedure is carried out, from which we exploit rendered stereo triplets to compensate for occlusions and depth maps as proxy labels. This results in stereo networks capable of predicting sharp and detailed disparity maps. Experimental results show that models trained under this regime yield a 30-40% improvement over existing self-supervised methods on the challenging Middlebury dataset, filling the gap to supervised models and, most times, outperforming them at zero-shot generalization.
CompletionFormer: Depth Completion with Convolutions and Vision Transformers Zhang Youmin, Guo Xianda, Poggi Matteo, Zhu Zheng, Huang Guan, and Mattoccia Stefano In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [Abs]
Given sparse depths and the corresponding RGB images, depth completion aims at spatially propagating the sparse measurements throughout the whole image to get a dense depth prediction. Despite the tremendous progress of deep-learning-based depth completion methods, the locality of the convolutional layer or graph model makes it hard for the network to model the long-range relationship between pixels. While recent fully Transformer-based architecture has reported encouraging results with the global receptive field, the performance and efficiency gaps to the well-developed CNN models still exist because of its deteriorative local feature details. This paper proposes a Joint Convolutional Attention and Transformer block (JCAT), which deeply couples the convolutional attention layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure. This hybrid architecture naturally benefits both the local connectivity of convolutions and the global context of the Transformer in one single model. As a result, our CompletionFormer outperforms state-of-the-art CNNs-based methods on the outdoor KITTI Depth Completion benchmark and indoor NYUv2 dataset, achieving significantly higher efficiency (nearly 1/3 FLOPs) compared to pure Transformer-based methods. Code is available at https://github.com/youmi-zym/CompletionFormer.
Contrastive Learning for Depth Prediction Fan Rizhao, Poggi Matteo, and Mattoccia Stefano In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) [Abs]
Depth prediction is at the core of several computer vision applications, such as autonomous driving and robotics. It is often formulated as a regression task in which depth values are estimated through network layers. Unfortunately, the distribution of depth maps is seldom explored. Therefore, this paper proposes a novel framework combining contrastive learning and depth prediction, allowing us to pay more attention to depth distribution and consequently enabling improvements to the overall estimation process. Purposely, we propose a window-based contrastive learning approach, which partitions the feature maps into non-overlapping windows and constructs contrastive loss within each one. Forming and sorting positive and negative pairs, then enlarging the gap between the two in the representation space, constraint depth distribution to fit the feature of the depth map. Experiments on KITTI and NYU datasets demonstrate the effectiveness of our framework.
NTIRE 2023 Challenge on HR Depth From Images of Specular and Transparent Surfaces Ramirez Pierluigi Zama, Tosi Fabio, Di Stefano Luigi, Timofte Radu, Costanzino Alex, Poggi Matteo, Salti Samuele, Mattoccia Stefano, Shi Jun, Zhang Dafeng, A Yong, Jin Yixiang, Li Dingzhe, Li Chao, Liu Zhiwen, Zhang Qi, Wang Yixing, and Yin Shi In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPRW) [Abs]
This paper reports about the NTIRE 2023 challenge on HR Depth From images of Specular and Transparent surfaces, held in conjunction with the New Trends in Image Restoration and Enhancement workshop (NTIRE) workshop at CVPR 2023. This challenge is held to boost the research on depth estimation, mainly to deal with two of the open issues in the field: high-resolution images and non-Lambertian surfaces characterizing specular and transparent materials. The challenge is divided into two tracks: a stereo track focusing on disparity estimation from rectified pairs and a mono track dealing with single-image depth estimation. The challenge attracted about 100 registered participants for the two tracks. In the final testing stage, 5 participating teams submitted their models and fact sheets, 2 and 3 for the Stereo and Mono tracks, respectively.
The Second Monocular Depth Estimation Challenge Spencer Jaime, Qian C. Stella, Trescakova Michaela, Russell Chris, Hadfield Simon, Graf Erich W., Adams Wendy J., Schofield Andrew J., Elder James, Bowden Richard, Anwar Ali, Chen Hao, Chen Xiaozhi, Cheng Kai, Dai Yuchao, Hoa Huynh Thai, Hossain Sadat, Huang Jianmian, Jing Mohan, Li Bo, Li Chao, Li Baojun, Liu Zhiwen, Mattoccia Stefano, Mercelis Siegfried, Nam Myungwoo, Poggi Matteo, Qi Xiaohua, Ren Jiahui, Tang Yang, Tosi Fabio, Trinh Linh, Uddin S. M. Nadim, Umair Khan Muhammad, Wang Kaixuan, Wang Yufei, Wang Yixing, Xiang Mochu, Xu Guangkai, Yin Wei, Yu Jun, Zhang Qi, and Zhao Chaoqiang In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) [Abs]
This paper discusses the results for the second edition of the Monocular Depth Estimation Challenge (MDEC). This edition was open to methods using any form of supervision, including fully-supervised, self-supervised, multi-task or proxy depth. The challenge was based around the SYNS-Patches dataset, which features a wide diversity of environments with high-quality dense ground-truth. This includes complex natural environments, e.g. forests or fields, which are greatly underrepresented in current benchmarks. The challenge received eight unique submissions that outperformed the provided SotA baseline on any of the pointcloud- or image-based metrics. The top supervised submission improved relative F-Score by 27.62%, while the top self-supervised improved it by 16.61%. Supervised submissions generally leveraged large collections of datasets to improve data diversity. Self-supervised submissions instead updated the network architecture and pretrained backbones. These results represent a significant progress in the field, while highlighting avenues for future research, such as reducing interpolation artifacts at depth boundaries, improving self-supervised indoor performance and overall natural image accuracy.
The Monocular Depth Estimation Challenge Spencer Jaime, Qian C. Stella, Russell Chris, Hadfield Simon, Graf Erich, Adams Wendy, Schofield Andrew J., Elder James H., Bowden Richard, Cong Heng, Mattoccia Stefano, Poggi Matteo, Suri Zeeshan Khan, Tang Yang, Tosi Fabio, Wang Hao, Zhang Youmin, Zhang Yusheng, and Zhao Chaoqiang In Winter Conference on Applications of Computer Vision Workshops (WACVW) [Abs]
This paper summarizes the results of the first Monocular Depth Estimation Challenge (MDEC) organized at WACV2023. This challenge evaluated the progress of self-supervised monocular depth estimation on the challenging SYNS-Patches dataset. The challenge was organized on CodaLab and received 6 submissions over the course of 40 days. Participants were provided a devkit containing updated reference implementations for 16 State-of-the-Art algorithms and 4 novel techniques. The threshold for acceptance for novel techniques was to outperform every one of the 16 SotA baselines. All participants outperformed the baseline in traditional metrics, such as MAE or AbsRel. However, pointcloud reconstruction metrics were challenging to improve upon. We found predictions were characterized by interpolation artefacts at object boundaries and errors in relative object positioning. We hope this challenge is a valuable contribution to the community and encourage authors to participate in future editions.
ScanNeRF: a Scalable Benchmark for Neural Radiance Fields De Luigi Luca, Bolognini Damiano, Domeniconi Federico, De Gregorio Daniele, Poggi Matteo, and Di Stefano Luigi In Winter Conference on Applications of Computer Vision (WACV) [Abs]
In this paper, we propose the first-ever real benchmark thought for evaluating Neural Radiance Fields (NeRFs) and, in general, Neural Rendering (NR) frameworks. We design and implement an effective pipeline for scanning real objects in quantity and effortlessly. Our scan station is built with less than 500$ hardware budget and can collect roughly 4000 images of a scanned object in just 5 minutes. Such a platform is used to build ScanNeRF, a dataset characterized by several train/val/test splits aimed at benchmarking the performance of modern NeRF methods under different conditions. Accordingly, we evaluate three cutting-edge NeRF variants on it to highlight their strengths and weakness. The dataset is available in our project page, together with an online benchmark to foster the development of better and better NeRFs.
Sparsity Agnostic Depth Completion Conti Andrea, Poggi Matteo, and Mattoccia Stefano In Winter Conference on Applications of Computer Vision (WACV) [Abs]
We present a novel depth completion approach agnostic to the sparsity of depth points. State-of-the-art approaches yield accurate results only when processing a specific density and distribution of input points, i.e. the one observed during training, narrowing their deployment in real use cases. On the contrary, our solution is robust to uneven distributions and extremely low densities never witnessed during training. Experimental results on standard indoor and outdoor benchmarks highlight the robustness of our framework, achieving accuracy comparable to state-of-the-art methods when tested with density and distribution equal to the training one while being much more accurate in the other cases.

2022

A Cascade Dense Connection Fusion Network for Depth Completion Fan Rizhao, Li Zhigen, Poggi Matteo, and Mattoccia Stefano In British Machine Vision Conference (BMVC) [Abs]
This paper proposes a lightweight yet effective network architecture for depth completion. It enables to fuse multi-modal and multi-level features through a Cascade Dense Connection Fusion Network. This is implemented by means of a dense connection fusion block, multi-scale features and a modality-aware aggregation mechanism. Our model is evaluated on the KITTI benchmark and achieves competitive results compared with state-of-the-art while counting much fewer parameters.
Cross-Spectral Neural Radiance Fields Poggi Matteo, Zama Ramirez Pierluigi, Tosi Fabio, Salti Samuele, Di Stefano Luigi, and Mattoccia Stefano In International Conference on 3D Vision (3DV) [Abs]
We propose X-NeRF, a novel method to learn a CrossSpectral scene representation given images captured from cameras with different light spectrum sensitivity, based on the Neural Radiance Fields formulation. X-NeRF optimizes camera poses across spectra during training and exploits Normalized Cross-Device Coordinates (NXDC) to render images of different modalities from arbitrary viewpoints, which are aligned and at the same resolution. Experiments on 16 forward-facing scenes, featuring color, multi-spectral and infrared images, confirm the effectiveness of X-NeRF at modeling Cross-Spectral scene representations.
MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer Zhao Chaoqiang, Zhang Youmin, Poggi Matteo, Tosi Fabio, Guo Xianda, Zhu Zheng, Huang Guan, Tang Yang, and Mattoccia Stefano In International Conference on 3D Vision (3DV) [Abs]
Self-supervised monocular depth estimation is an attractive solution that does not require hard-to-source depth labels for training. Convolutional neural networks (CNNs) have recently achieved great success in this task. However, their limited receptive field constrains existing network architectures to reason only locally, dampening the effectiveness of the self-supervised paradigm. In the light of the recent successes achieved by Vision Transformers (ViTs), we propose MonoViT, a brand-new framework combining the global reasoning enabled by ViT models with the flexibility of self-supervised monocular depth estimation. By combining plain convolutions with Transformer blocks, our model can reason locally and globally, yielding depth prediction at a higher level of detail and accuracy, allowing MonoViT to achieve state-of-the-art performance on the established KITTI dataset. Moreover, MonoViT proves its superior generalization capacities on other datasets such as Make3D and DrivingStereo.
Online Domain Adaptation for Semantic Segmentation in Ever-Changing Conditions Panagiotakopoulos Theodoros, Dovesi Pier Luigi, Härenstam-Nielsen Linus, and Poggi Matteo In European Conference on Computer Vision (ECCV) [Abs]
Unsupervised domain adaptation (UDA) aims at reducing the domain gap between training and testing data and is, in most cases, carried out in offline manner. However, domain changes may occur continuously and unpredictably during deployment (e.g. sudden weather changes). In such conditions, deep neural networks witness dramatic drops in accuracy and offline adaptation may not be enough to contrast it. In this paper, we tackle online UDA (OUDA) for semantic segmentation. We design a pipeline that is robust to continuous domain shifts, either gradual or sudden, and we evaluate it in the case of rainy and foggy scenarios. Our experiments show that our framework can effectively adapt to new domains during deployment, while not being affected by catastrophic forgetting of the previous domains.
Multi-View Guided Multi-View Stereo Poggi Matteo, Conti Andrea, and Mattoccia Stefano In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) [Abs]
This paper introduces a novel deep framework for dense 3D reconstruction from multiple image frames, leveraging a sparse set of depth measurements gathered jointly with image acquisition. Given a deep multi-view stereo network, our framework uses sparse depth hints to guide the neural network by modulating the plane-sweep cost volume built during the forward step, enabling us to infer constantly much more accurate depth maps. Moreover, since multiple viewpoints can provide additional depth measurements, we propose a multi-view guidance strategy that increases the density of the sparse points used to guide the network, thus leading to even more accurate results. We evaluate our Multi-View Guided framework within a variety of state-of-the-art deep multi-view stereo networks, demonstrating its effectiveness at improving the results achieved by each of them on BlendedMVG and DTU datasets.
Unsupervised confidence for LiDAR depth maps and applications Conti Andrea, Poggi Matteo, Aleotti Filippo, and Mattoccia Stefano In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) [Abs]
Depth perception is pivotal in many fields, such as robotics and autonomous driving, to name a few. Consequently, depth sensors such as LiDARs rapidly spread in many applications. The 3D point clouds generated by these sensors must often be coupled with an RGB camera to understand the framed scene semantically. Usually, the former is projected over the camera image plane, leading to a sparse depth map. Unfortunately, this process, coupled with the intrinsic issues affecting all the depth sensors, yields noise and gross outliers in the final output. Purposely, in this paper, we propose an effective unsupervised framework aimed at explicitly addressing this issue by learning to estimate the confidence of the LiDAR sparse depth map and thus allowing for filtering out the outliers. Experimental results on the KITTI dataset highlight that our framework excels for this purpose. Moreover, we demonstrate how this achievement can improve a wide range of tasks.
Open Challenges in Deep Stereo: the Booster Dataset Zama Ramirez Pierluigi, Tosi Fabio, Poggi Matteo, Salti Samuele, Mattoccia Stefano, and Di Stefano Luigi In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [Abs]
We present a novel high-resolution and challenging stereo dataset framing indoor scenes annotated with dense and accurate ground-truth disparities. Peculiar to our dataset is the presence of several specular and transparent surfaces, i.e. the main causes of failures for state-of-the-art stereo networks. Our acquisition pipeline leverages a novel deep space-time stereo framework which allows for easy and accurate labeling with sub-pixel precision. We release a total of 419 samples collected in 64 different scenes and annotated with dense ground-truth disparities. Each sample include a high-resolution pair (12 Mpx) as well as an unbalanced pair (Left: 12 Mpx, Right: 1.1 Mpx). Additionally, we provide manually annotated material segmentation masks and 15K unlabeled samples. We evaluate state-of-the-art deep networks based on our dataset, highlighting their limitations in addressing the open challenges in stereo and drawing hints for future research.
RGB-Multispectral Matching: Dataset, Learning Methodology, Evaluation Tosi Fabio, Zama Ramirez Pierluigi, Poggi Matteo, Salti Samuele, Mattoccia Stefano, and Di Stefano Luigi In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [Abs]
We address the problem of registering synchronized color (RGB) and multi-spectral (MS) images featuring very different resolution by solving stereo matching correspondences. Purposely, we introduce a novel RGB-MS dataset framing 13 different scenes in indoor environments and providing a total of 34 image pairs annotated with semi-dense, high-resolution ground-truth labels in the form of disparity maps. To tackle the task, we propose a deep learning architecture trained in a self-supervised manner by exploiting a further RGB camera, required only during training data acquisition. In this setup, we can conveniently learn cross-modal matching in the absence of ground-truth labels by distilling knowledge from an easier RGB-RGB matching task based on a collection of about 11K unlabeled image triplets. Experiments show that the proposed pipeline sets a good performance bar (1.16 pixels average registration error) for future research on this novel, challenging task.
Meta-confidence estimation for stereo matching Kim Seungryong, Poggi Matteo, Kim Sunok, Sohn Kwanghoon, and Mattoccia Stefano In International Conference on Robotics and Automation (ICRA) [Abs]
We propose a novel framework to estimate the confidence of a disparity map taking into account, for the first time, the uncertainty affecting the confidence estimation process itself. Conversely to other tasks such as disparity estimation, the uncertainty of confidence directly hints that the confidence should be increased if initially low, but with high uncertainty, decreased otherwise. By modelling such a cue in the form of a second-level confidence, or meta-confidence, our solution allows for finding incorrect predictions inferred by confidence estimator and for learning a correction for them. Our strategy is suited for any state-of-the-art method known in literature, either implemented using random forest classifiers or deep neural networks. Especially, for deep neural networks-based models, we present a multi-headed confidence estimator followed by an uncertainty network, so as to predict mean confidence and meta-confidence within a single network without the cost of lower accuracy, a known limitation in literature for uncertainty estimation. Experimental results on a variety of stereo algorithms and confidence estimation models prove that the modeled meta-confidence is meaningful of the reliability of the estimated confidence and allows for refining it.

2021

Neural Disparity Refinement for Arbitrary Resolution Stereo Aleotti Filippo, Tosi Fabio, Zama Ramirez Pierluigi, Poggi Matteo, Salti Samuele, Di Stefano Luigi, and Mattoccia Stefano In International Conference on 3D Vision (3DV, Oral, Best Paper Honorable Mention) [Abs]
We introduce a novel architecture for neural disparity refinement aimed at facilitating deployment of 3D computer vision on cheap and widespread consumer devices, such as mobile phones. Our approach relies on a continuous formulation that enables to estimate a refined disparity map at any arbitrary output resolution. Thereby, it can handle effectively the unbalanced camera setup typical of nowadays mobile phones, which feature both high and low resolution RGB sensors within the same device. Moreover, our neural network can process seamlessly the output of a variety of stereo methods and, by refining the disparity maps computed by a traditional matching algorithm like SGM, it can achieve unpaired zero-shot generalization performance compared to state-of-the-art end-to-end stereo models.
Sensor-Guided Optical Flow Poggi Matteo, Aleotti Filippo, and Mattoccia Stefano In IEEE International Conference on Computer Vision (ICCV) [Abs]
This paper proposes a framework to guide an optical flow network with external cues to achieve superior accuracy either on known or unseen domains. Given the availability of sparse yet accurate optical flow hints from an external source, these are injected to modulate the correlation scores computed by a state-of-the-art optical flow network and guide it towards more accurate predictions. Although no real sensor can provide sparse flow hints, we show how these can be obtained by combining depth measurements from active sensors with geometry and hand-crafted optical flow algorithms, leading to accurate enough hints for our purpose. Experimental results with a state-of-the-art flow network on standard benchmarks support the effectiveness of our framework, both in simulated and real conditions.
Learning optical flow from still images Aleotti Filippo, Poggi Matteo, and Mattoccia Stefano In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [Abs]
This paper deals with the scarcity of data for training optical flow networks, highlighting the limitations of existing sources such as labeled synthetic datasets or unlabeled real videos. Specifically, we introduce a framework to generate accurate ground-truth optical flow annotations quickly and in large amounts from any readily available single real picture. Given an image, we use an off-the-shelf monocular depth estimation network to build a plausible point cloud for the observed scene. Then, we virtually move the camera in the reconstructed environment with known motion vectors and rotation angles, allowing us to synthesize both a novel view and the corresponding optical flow field connecting each of its pixels to the ones in the actual input image. When trained with our data, state-of-the-art optical flow networks achieve superior generalization to unseen real data compared to the same models trained either on annotated synthetic datasets or unlabeled videos, and better specialization if combined with synthetic images.

2020

Matching-space Stereo Networks for Cross-domain Generalization Cai Changjiang, Poggi Matteo, Mattoccia Stefano, and Mordohai Philippos In International Conference on 3D Vision (3DV) [Abs]
End-to-end deep networks represent the state of the art for stereo matching. While excelling on images framing environments similar to the training set, major drops in accuracy occur in unseen domains (e.g., when moving from synthetic to real scenes). In this paper we introduce a novel family of architectures, namely Matching-Space Networks (MS-Nets), with improved generalization properties. By replacing learning-based feature extraction from image RGB values with matching functions and confidence measures from conventional wisdom, we move the learning process from the color space to the Matching Space, avoiding over-specialization to domain specific features. Extensive experimental results on four real datasets highlight that our proposal leads to superior generalization to unseen environments over conventional deep architectures, keeping accuracy on the source domain almost unaltered.
Self-adapting confidence estimation for stereo Poggi Matteo, Aleotti Filippo, Tosi Fabio, Zaccaroni Giulio, and Mattoccia Stefano In European Conference on Computer Vision (ECCV) [Abs]
Estimating the confidence of disparity maps inferred by a stereo algorithm has become a very relevant task in the years, due to the increasing number of applications leveraging such cue. Although self-supervised learning has recently spread across many computer vision tasks, it has been barely considered in the field of confidence estimation. In this paper, we propose a flexible and lightweight solution enabling self-adapting confidence estimation agnostic to the stereo algorithm or network. Our approach relies on the minimum information available in any stereo setup (i.e., the input stereo pair and the output disparity map) to learn an effective confidence measure. This strategy allows us not only a seamless integration with any stereo system, including consumer and industrial devices equipped with undisclosed stereo perception methods, but also, due to its self-adapting capability, for its out-of-the-box deployment in the field. Exhaustive experimental results with different standard datasets support our claims, showing how our solution is the first-ever enabling online learning of accurate confidence estimation for any stereo system and without any requirement for the end-user.
Reversing the cycle: self-supervised deep stereo through enhanced monocular distillation Aleotti Filippo, Tosi Fabio, Zhang Li, Poggi Matteo, and Mattoccia Stefano In European Conference on Computer Vision (ECCV) [Abs]
In many fields, self-supervised learning solutions are rapidly evolving and filling the gap with supervised approaches. This fact occurs for depth estimation based on either monocular or stereo, with the latter often providing a valid source of self-supervision for the former. In contrast, to soften typical stereo artefacts, we propose a novel self-supervised paradigm reversing the link between the two. Purposely, in order to train deep stereo networks, we distill knowledge through a monocular completion network. This architecture exploits single-image clues and few sparse points, sourced by traditional stereo algorithms, to estimate dense yet accurate disparity maps by means of a consensus mechanism over multiple estimations. We thoroughly evaluate with popular stereo datasets the impact of different supervisory signals showing how stereo networks trained with our paradigm outperform existing self-supervised frameworks. Finally, our proposal achieves notable generalization capabilities dealing with domain shift issues.
On the uncertainty of self-supervised monocular depth estimation Poggi Matteo, Aleotti Filippo, Tosi Fabio, and Mattoccia Stefano In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [Abs]
Self-supervised paradigms for monocular depth estimation are very appealing since they do not require ground truth annotations at all. Despite the astonishing results yielded by such methodologies, learning to reason about the uncertainty of the estimated depth maps is of paramount importance for practical applications, yet uncharted in the literature. Purposely, we explore for the first time how to estimate the uncertainty for this task and how this affects depth accuracy, proposing a novel peculiar technique specifically designed for self-supervised approaches. On the standard KITTI dataset, we exhaustively assess the performance of each method with different self-supervised paradigms. Such evaluation highlights that our proposal i) always improves depth accuracy significantly and ii) yields state-of-the-art results concerning uncertainty estimation when training on sequences and competitive results uniquely deploying stereo pairs.
Distilled semantics for comprehensive scene understanding from videos Tosi Fabio, Aleotti Filippo, Zama Ramirez Pierluigi, Poggi Matteo, Salti Samuele, Di Stefano Luigi, and Mattoccia Stefano In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [Abs]
Whole understanding of the surroundings is paramount to autonomous systems. Recent works have shown that deep neural networks can learn geometry (depth) and motion (optical flow) from a monocular video without any explicit supervision from ground truth annotations, particularly hard to source for these two tasks. In this paper, we take an additional step toward holistic scene understanding with monocular cameras by learning depth and motion alongside with semantics, with supervision for the latter provided by a pre-trained network distilling proxy ground truth images. We address the three tasks jointly by a) a novel training protocol based on knowledge distillation and self-supervision and b) a compact network architecture which enables efficient scene understanding on both power hungry GPUs and low-power embedded platforms. We thoroughly assess the performance of our framework and show that it yields state-of-the-art results for monocular depth estimation, optical flow and motion segmentation.
Leveraging a weakly adversarial paradigm for joint learning of disparity and confidence estimation Poggi Matteo, Tosi Fabio, Aleotti Filippo, and Mattoccia Stefano In International Conference on Pattern Recognition (ICPR) [Abs]
Deep architectures represent the state-of-the-art for perceiving depth from stereo images. Although these methods are highly accurate, it is crucial to effectively detect any outlier through confidence measures since a wrong perception of even small portions of the sensed scene might lead to catastrophic consequences, for instance, in autonomous driving. Purposely, state-of-the-art confidence estimation methods rely on deep-networks as well. In this paper, arguing that these tasks are two sides of the same coin, we propose a novel paradigm for their joint training. Specifically, inspired by the successful deployment of GANs in other fields, we design two deep architectures: a generator for disparity estimation and a discriminator for distinguishing correct assignments from outliers. The two networks are jointly trained in a new peculiar weakly adversarial manner pushing the former to fix the errors detected by the discriminator while keeping the correct prediction unchanged. Experimental results on standard stereo datasets prove that such joint training paradigm is beneficial. Moreover, an additional outcome of our proposal is the ability to detect outliers with better accuracy compared to the state-of-the-art.
Enabling monocular depth perception at the very edge Peluso Valentino, Cipolletta Antonio, Calimera Andrea, Poggi Matteo, Tosi Fabio, Aleotti Filippo, and Mattoccia Stefano In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) [Abs]
Depth estimation is crucial in several computer vision applications, and a recent trend aims at inferring such a cue from a single camera through computationally demanding CNNs — precluding their practical deployment in several application contexts characterized by low-power constraints. Purposely, we develop a tiny network tailored to microcontrollers, processing low-resolution images to obtain a coarse depth map of the observed scene. Our solution enables depth perception with minimal power requirements (a few hundreds of mW), accurately enough to pave the way to several high-level applications at-the-edge.
Learning End-To-End Scene Flow by Distilling Single Tasks Knowledge Aleotti Filippo, Poggi Matteo, Tosi Fabio, and Mattoccia Stefano In AAAI Conference on Artificial Intelligence (AAAI) [Abs]
Scene flow is a challenging task aimed at jointly estimating the 3D structure and motion of the sensed environment. Although deep learning solutions achieve outstanding performance in terms of accuracy, these approaches divide the whole problem into standalone tasks (stereo and optical flow) addressing them with independent networks. Such a strategy dramatically increases the complexity of the training procedure and requires power-hungry GPUs to infer scene flow barely at 1 FPS. Conversely, we propose DWARF, a novel and lightweight architecture able to infer full scene flow jointly reasoning about depth and optical flow easily and elegantly trainable end-to-end from scratch. Moreover, since ground truth images for full scene flow are scarce, we propose to leverage on the knowledge learned by networks specialized in stereo or flow, for which much more data are available, to distill proxy annotations. Exhaustive experiments show that i) DWARF runs at about 10 FPS on a single high-end GPU and about 1 FPS on NVIDIA Jetson TX2 embedded at KITTI resolution, with moderate drop in accuracy compared to 10x deeper models, ii) learning from many distilled samples is more effective than from the few, annotated ones available. Code available at: https://github.com/FilippoAleotti/Dwarf-Tensorflow
Real-time semantic stereo matching Dovesi Pier Luigi, Poggi Matteo, Andraghetti Lorenzo, Martı́ Miquel, Kjellström Hedvig, Pieropan Alessandro, and Mattoccia Stefano In International Conference on Robotics and Automation (ICRA) [Abs]
Scene understanding is paramount in robotics, self-navigation, augmented reality, and many other fields. To fully accomplish this task, an autonomous agent has to infer the 3D structure of the sensed scene (to know where it looks at) and its content (to know what it sees). To tackle the two tasks, deep neural networks trained to infer semantic segmentation and depth from stereo images are often the preferred choices. Specifically, Semantic Stereo Matching can be tackled by either standalone models trained for the two tasks independently or joint end-to-end architectures. Nonetheless, as proposed so far, both solutions are inefficient because requiring two forward passes in the former case or due to the complexity of a single network in the latter, although jointly tackling both tasks is usually beneficial in terms of accuracy. In this paper, we propose a single compact and lightweight architecture for real-time semantic stereo matching. Our framework relies on coarse-to-fine estimations in a multi-stage fashion, allowing: i) very fast inference even on embedded devices, with marginal drops in accuracy, compared to state-of-the-art networks, ii) trade accuracy for speed, according to the specific application requirements. Experimental results on high-end GPUs as well as on an embedded Jetson TX2 confirm the superiority of semantic stereo matching compared to standalone tasks and highlight the versatility of our framework on any hardware and for any application.

2019

Guided stereo matching Poggi Matteo, Pallotti Davide, Tosi Fabio, and Mattoccia Stefano In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [Abs]
Stereo is a prominent technique to infer dense depth maps from images, and deep learning further pushed forward the state-of-the-art, making end-to-end architectures unrivaled when enough data is available for training. However, deep networks suffer from significant drops in accuracy when dealing with new environments. Therefore, in this paper, we introduce Guided Stereo Matching, a novel paradigm leveraging a small amount of sparse, yet reliable depth measurements retrieved from an external source enabling to ameliorate this weakness. The additional sparse cues required by our method can be obtained with any strategy (e.g., a LiDAR) and used to enhance features linked to corresponding disparity hypotheses. Our formulation is general and fully differentiable, thus enabling to exploit the additional sparse inputs in pre-trained deep stereo networks as well as for training a new instance from scratch. Extensive experiments on three standard datasets and two state-of-the-art deep architectures show that even with a small set of sparse input cues, i) the proposed paradigm enables significant improvements to pre-trained networks. Moreover, ii) training from scratch notably increases accuracy and robustness to domain shifts. Finally, iii) it is suited and effective even with traditional stereo algorithms such as SGM.
Real-time self-adaptive deep stereo Tonioni Alessio, Tosi Fabio, Poggi Matteo, Mattoccia Stefano, and Stefano Luigi Di In IEEE Conference on Computer Vision and Pattern Recognition (CVPR, Oral) [Abs]
Deep convolutional neural networks trained end-to-end are the state-of-the-art methods to regress dense disparity maps from stereo pairs. These models, however, suffer from a notable decrease in accuracy when exposed to scenarios significantly different from the training set (e.g., real vs synthetic images, etc.). We argue that it is extremely unlikely to gather enough samples to achieve effective training/tuning in any target domain, thus making this setup impractical for many applications. Instead, we propose to perform unsupervised and continuous online adaptation of a deep stereo network, which allows for preserving its accuracy in any environment. However, this strategy is extremely computationally demanding and thus prevents real-time inference. We address this issue introducing a new lightweight, yet effective, deep stereo architecture, Modularly ADaptive Network(MADNet), and developing a Modular ADaptation (MAD) algorithm, which independently trains sub-portions of the network. By deploying MADNet together with MAD we introduce the first real-time self-adaptive deep stereo system enabling competitive performance on heterogeneous datasets. Our code is publicly available at https://github.com/CVLAB-Unibo/Real-time-self-adaptive-deep-stereo.
Learning monocular depth estimation infusing traditional stereo knowledge Tosi Fabio, Aleotti Filippo, Poggi Matteo, and Mattoccia Stefano In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [Abs]
Depth estimation from a single image represents a fascinating, yet challenging problem with countless applications. Recent works proved that this task could be learned without direct supervision from ground truth labels leveraging image synthesis on sequences or stereo pairs. Focusing on this second case, in this paper we leverage stereo matching in order to improve monocular depth estimation. To this aim we propose monoResMatch, a novel deep architecture designed to infer depth from a single input image by synthesizing features from a different point of view, horizontally aligned with the input image, performing stereo matching between the two cues. In contrast to previous works sharing this rationale, our network is the first trained end-to-end from scratch. Moreover, we show how obtaining proxy ground truth annotation through traditional stereo algorithms, such as Semi-Global Matching, enables more accurate monocular depth estimation still countering the need for expensive depth labels by keeping a self-supervised approach. Exhaustive experimental results prove how the synergy between i) the proposed monoResMatch architecture and ii) proxy-supervision attains state-of-the-art for self-supervised monocular depth estimation. The code is publicly available at https://github.com/fabiotosi92/monoResMatch-Tensorflow.
Leveraging confident points for accurate depth refinement on embedded systems Tosi Fabio, Poggi Matteo, and Mattoccia Stefano In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) [Abs]
Despite the notable progress in stereo disparity estimation, algorithms are still prone to errors in challenging conditions. Thus, heuristic disparity refinement techniques are usually deployed to improve accuracy. Moreover, state-of-the-art methods rely on complex CNNs requiring power hungry GPUs not suited for many practical applications constrained by limited computing resources. In this paper, we propose a novel technique for disparity refinement leveraging on confidence measures and a novel, automatic learning-based selection method to discard outliers. Then, a non-local strategy infers missing disparities by analyzing the closest reliable points. This framework is very fast and does not require any hand-tuned thresholding. We assess the performance of our Non-Local Anchoring (NLA), standalone refinement techniques and methods leveraging on confidence measures inside the stereo algorithm. Our evaluation with two popular stereo algorithms shows that our proposal significantly ameliorates their accuracy on Middlebury v3 and KITTI 2015 datasets. Moreover, since our method relies only on cues computed in the disparity domain, it is suited even for COTS stereo cameras coupled with embedded systems, e.g. nVidia Jetson TX2.
Enabling energy-efficient unsupervised monocular depth estimation on armv7-based platforms Peluso Valentino, Cipolletta Antonio, Calimera Andrea, Poggi Matteo, Tosi Fabio, and Mattoccia Stefano In Design, Automation & Test in Europe Conference & Exhibition (DATE) [Abs]
This work deals with the implementation of energy-efficient monocular depth estimation using a low-cost CPU for low-power embedded systems. It first describes the PyD-Net depth estimation network, which consists of a lightweight CNN able to approach state-of-the-art accuracy with ultra-low resource usage. Then it proposes an accuracy-driven complexity reduction strategy based on a hardware-friendly fixed-point quantization. Finally, it introduces the low-level optimization enabling effective use of integer neural kernels. The objective is threefold: (i) prove the efficiency of the new quantization flow on a depth estimation network, that is, the capability to retaining the accuracy reached by floating-point arithmetic using 16- and 8-bit integers, (ii) demonstrate the portability of the quantized model into a general-purpose 32-bit RISC architecture of the ARM Cortex family, (iii) quantify the accuracy-energy tradeoff of unsupervised monocular estimation to establish its use in the embedded domain. The experiments have been run on a Raspberry PI board powered by a Broadcom BCM2837 chipset. A parametric analysis conducted over the KITTI date-set shows marginal accuracy loss with 16-bit (8-bit) integers and energy savings up to 6.55× (9.23×) w.r.t. floating-point. Compared to high-end CPU and GPU the proposed solution improves scalability.
Enhancing self-supervised monocular depth estimation with traditional visual odometry Andraghetti Lorenzo, Myriokefalitakis Panteleimon, Dovesi Pier Luigi, Luque Belen, Poggi Matteo, Pieropan Alessandro, and Mattoccia Stefano In International Conference on 3D Vision (3DV) [Abs]
Estimating depth from a single image represents an attractive alternative to more traditional approaches leveraging multiple cameras. In this field, deep learning yielded outstanding results at the cost of needing large amounts of data labeled with precise depth measurements for training. An issue softened by self-supervised approaches leveraging monocular sequences or stereo pairs in place of expensive ground truth depth annotations. This paper enables to further improve monocular depth estimation by integrating into existing self-supervised networks a geometrical prior. Specifically, we propose a sparsity-invariant autoencoder able to process the output of conventional visual odometry algorithms working in synergy with depth-from-mono networks. Experimental results on the KITTI dataset show that by exploiting the geometrical prior, our proposal: i) outperforms existing approaches in the literature and ii) couples well with both compact and complex depth-from-mono architectures, allowing for its deployment on high-end GPUs as well as on embedded devices (e.g., NVIDIA Jetson TX2).

2018

Beyond local reasoning for stereo confidence estimation with deep learning Tosi Fabio, Poggi Matteo, Benincasa Antonio, and Mattoccia Stefano In European Conference on Computer Vision (ECCV) [Abs]
Confidence measures for stereo gained popularity in recent years due to their improved capability to detect outliers and the increasing number of applications exploiting these cues. In this field, convolutional neural networks achieved top-performance compared to other known techniques in the literature by processing local information to tell disparity assignments from outliers. Despite this outstanding achievements, all approaches rely on clues extracted with small receptive fields thus ignoring most of the overall image content. Therefore, in this paper, we propose to exploit nearby and farther clues available from image and disparity domains to obtain a more accurate confidence estimation. While local information is very effective for detecting high frequency patterns, it lacks insights from farther regions in the scene. On the other hand, enlarging the receptive field allows to include clues from farther regions but produces smoother uncertainty estimation, not particularly accurate when dealing with high frequency patterns. For these reasons, we propose in this paper a multi-stage cascaded network to combine the best of the two worlds. Extensive experiments on three datasets using three popular stereo algorithms prove that the proposed framework outperforms state-of-the-art confidence estimation techniques.
Towards real-time unsupervised monocular depth estimation on cpu Poggi Matteo, Aleotti Filippo, Tosi Fabio, and Mattoccia Stefano In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) [Abs]
Unsupervised depth estimation from a single image is a very attractive technique with several implications in robotic, autonomous navigation, augmented reality and so on. This topic represents a very challenging task and the advent of deep learning enabled to tackle this problem with excellent results. However, these architectures are extremely deep and complex. Thus, real-time performance can be achieved only by leveraging power-hungry GPUs that do not allow to infer depth maps in application fields characterized by low-power constraints. To tackle this issue, in this paper we propose a novel architecture capable to quickly infer an accurate depth map on a CPU, even of an embedded system, using a pyramid of features extracted from a single input image. Similarly to state-of-the-art, we train our network in an unsupervised manner casting depth estimation as an image reconstruction problem. Extensive experimental results on the KITTI dataset show that compared to the top performing approach our network has similar accuracy but a much lower complexity (about 6% of parameters) enabling to infer a depth map for a KITTI image in about 1.7 s on the Raspberry Pi 3 and at more than 8 Hz on a standard CPU. Moreover, by trading accuracy for efficiency, our network allows to infer maps at about 2 Hz and 40 Hz respectively, still being more accurate than most state-of-the-art slower methods. To the best of our knowledge, it is the first method enabling such performance on CPUs paving the way for effective deployment of unsupervised monocular depth estimation even on embedded systems.
Geometry meets semantics for semi-supervised monocular depth estimation Zama Ramirez Pierluigi, Poggi Matteo, Tosi Fabio, Mattoccia Stefano, and Di Stefano Luigi In Asian Conference on Computer Vision (ACCV) [Abs]
Depth estimation from a single image represents a very exciting challenge in computer vision. While other image-based depth sensing techniques leverage on the geometry between different viewpoints (e.g., stereo or structure from motion), the lack of these cues within a single image renders ill-posed the monocular depth estimation task. For inference, state-of-the-art encoder-decoder architectures for monocular depth estimation rely on effective feature representations learned at training time. For unsupervised training of these models, geometry has been effectively exploited by suitable images warping losses computed from views acquired by a stereo rig or a moving camera. In this paper, we make a further step forward showing that learning semantic information from images enables to improve effectively monocular depth estimation as well. In particular, by leveraging on semantically labeled images together with unsupervised signals gained by geometry through an image warping loss, we propose a deep learning approach aimed at joint semantic segmentation and depth estimation. Our overall learning framework is semi-supervised, as we deploy groundtruth data only in the semantic domain. At training time, our network learns a common feature representation for both tasks and a novel cross-task loss function is proposed. The experimental findings show how, jointly tackling depth prediction and semantic segmentation, allows to improve depth estimation accuracy. In particular, on the KITTI dataset our network outperforms state-of-the-art methods for monocular depth estimation.
Learning monocular depth estimation with unsupervised trinocular assumptions Poggi Matteo, Tosi Fabio, and Mattoccia Stefano In 2018 International Conference on 3D Vision (3DV) [Abs]
Obtaining accurate depth measurements out of a single image represents a fascinating solution to 3D sensing. CNNs led to considerable improvements in this field, and recent trends replaced the need for ground-truth labels with geometry-guided image reconstruction signals enabling unsupervised training. Currently, for this purpose, state-of-the-art techniques rely on images acquired with a binocular stereo rig to predict inverse depth (i.e., disparity) according to the aforementioned supervision principle. However, these methods suffer from well-known problems near occlusions, left image border, etc inherited from the stereo setup. Therefore, in this paper, we tackle these issues by moving to a trinocular domain for training. Assuming the central image as the reference, we train a CNN to infer disparity representations pairing such image with frames on its left and right side. This strategy allows obtaining depth maps not affected by typical stereo artifacts. Moreover, being trinocular datasets seldom available, we introduce a novel interleaved training procedure enabling to enforce the trinocular assumption outlined from current binocular datasets. Exhaustive experimental results on the KITTI dataset confirm that our proposal outperforms state-of-the-art methods for unsupervised monocular depth estimation trained on binocular stereo pairs as well as any known methods relying on other cues.
Generative adversarial networks for unsupervised monocular depth prediction Aleotti Filippo, Tosi Fabio, Poggi Matteo, and Mattoccia Stefano In European Conference on Computer Vision (ECCVW) [Abs]
Estimating depth from a single image is a very challenging and exciting topic in computer vision with implications in several application domains. Recently proposed deep learning approaches achieve outstanding results by tackling it as an image reconstruction task and exploiting geometry constraints (e.g., epipolar geometry) to obtain supervisory signals for training. Inspired by these works and compelling results achieved by Generative Adversarial Network (GAN) on image reconstruction and generation tasks, in this paper we propose to cast unsupervised monocular depth estimation within a GAN paradigm. The generator network learns to infer depth from the reference image to generate a warped target image. At training time, the discriminator network learns to distinguish between fake images generated by the generator and target frames acquired with a stereo rig. To the best of our knowledge, our proposal is the first successful attempt to tackle monocular depth estimation with a GAN paradigm and the extensive evaluation on CityScapes and KITTI datasets confirm that it enables to improve traditional approaches. Additionally, we highlight a major issue with data deployed by a standard evaluation protocol widely used in this field and fix this problem using a more reliable dataset recently made available by the KITTI evaluation benchmark.

2017

Learning to predict stereo reliability enforcing local consistency of confidence maps Poggi Matteo, and Mattoccia Stefano In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) [Abs]
Confidence measures estimate unreliable disparity assignments performed by a stereo matching algorithm and, as recently proved, can be used for several purposes. This paper aims at increasing, by means of a deep network, the effectiveness of state-of-the-art confidence measures exploiting the local consistency assumption. We exhaustively evaluated our proposal on 23 confidence measures, including 5 top-performing ones based on random-forests and CNNs, training our networks with two popular stereo algorithms and a small subset (25 out of 194 frames) of the KITTI 2012 dataset. Experimental results show that our approach dramatically increases the effectiveness of all the 23 confidence measures on the remaining frames. Moreover, without re-training, we report a further cross-evaluation on KITTI 2015 and Middlebury 2014 confirming that our proposal provides remarkable improvements for each confidence measure even when dealing with significantly different input data. To the best of our knowledge, this is the first method to move beyond conventional pixel-wise confidence estimation.
Quantitative evaluation of confidence measures in a machine learning world Poggi Matteo, Tosi Fabio, and Mattoccia Stefano In IEEE International Conference on Computer Vision (ICCV, Spotlight) [Abs]
Stereo matching is a popular technique to infer depth from two or more images and wealth of methods have been proposed to deal with this problem. Despite these efforts, finding accurate stereo correspondences is still an open problem. The strengths and weaknesses of existing methods are often complementary and in this paper, motivated by recent trends in this field, we exploit this fact by proposing Deep Stereo Fusion, a Convolutional Neural Network capable of combining the output of multiple stereo algorithms in order to obtain more accurate result with respect to each input disparity map. Deep Stereo Fusion process a 3D features vector, encoding both spatial and cross-algorithm information, in order to select the best disparity hypothesis among those proposed by the single stereo matchers. To the best of our knowledge, our proposal is the first i) to leverage on deep learning and ii) able to predict the optimal disparity assignments by taking only as input cue the disparity maps. This second feature makes our method suitable for deployment even when other cues (e.g., confidence) are not available such as when dealing with disparity maps provided by off-the-shelf 3D sensors. We thoroughly evaluate our proposal on the KITTI stereo benchmark with respect state-of-the-art in this field.
Unsupervised adaptation for deep stereo Tonioni Alessio, Poggi Matteo, Mattoccia Stefano, and Di Stefano Luigi In IEEE International Conference on Computer Vision (ICCV) [Abs]
Recent ground-breaking works have shown that deep neural networks can be trained end-to-end to regress dense disparity maps directly from image pairs. Computer generated imagery is deployed to gather the large data corpus required to train such networks, an additional fine-tuning allowing to adapt the model to work well also on real and possibly diverse environments. Yet, besides a few public datasets such as Kitti, the ground-truth needed to adapt the network to a new scenario is hardly available in practice. In this paper we propose a novel unsupervised adaptation approach that enables to fine-tune a deep learning stereo model without any ground-truth information. We rely on off-the-shelf stereo algorithms together with state-of-the-art confidence measures, the latter able to ascertain upon correctness of the measurements yielded by former. Thus, we train the network based on a novel loss-function that penalizes predictions disagreeing with the highly confident disparities provided by the algorithm and enforces a smoothness constraint. Experiments on popular datasets (KITTI 2012, KITTI 2015 and Middlebury 2014) and other challenging test images demonstrate the effectiveness of our proposal.
Learning confidence measures in the wild. Tosi Fabio, Poggi Matteo, Tonioni Alessio, Di Stefano Luigi, and Mattoccia Stefano In British Machine Vision Conference (BMVC) [Abs]
Conﬁdence measures for stereo earned increasing popularity in most recent works concerning stereo, being effectively deployed to improve its accuracy. While most measures are obtained by processing cues from the cost volume, top-performing ones usually leverage on random-forests or CNNs to predict match reliability. Therefore, a proper amount of labeled data is required to effectively train such conﬁdence measures. Being such ground-truth labels not always available in practical applications, in this paper we propose a methodology suited for training conﬁdence measures in a self-supervised manner. Leveraging on a pool of properly selected conventional measures, we automatically detect a subset of very reliable pixels as well as a subset of erroneous samples from the output of a stereo algorithm. This strategy provides labels for training conﬁdence measures based on machine-learning technique without ground-truth labels. Compared to state-of-the-art, our method is neither constrained to image sequences nor to image content.
Even more confident predictions with deep machine-learning Poggi Matteo, Tosi Fabio, and Mattoccia Stefano In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) [Abs]
Confidence measures aim at discriminating unreliable disparities inferred by a stereo vision system from reliable ones. A common and effective strategy adopted by most top-performing approaches consists in combining multiple confidence measures by means of an appropriately trained random-forest classifier. In this paper, we propose a novel approach by training an n-channel convolutional neural network on a set of feature maps, each one encoding the outcome of a single confidence measure. This strategy enables to move the confidence prediction problem from the conventional 1D feature maps domain, adopted by approaches based on random-forests, to a more distinctive 3D domain, going beyond single pixel analysis. This fact, coupled with a deep network appropriately trained on a small subset of images, enables to outperform top-performing approaches based on random-forests.
Efficient confidence measures for embedded stereo Poggi Matteo, Tosi Fabio, and Mattoccia Stefano In International Conference on Image Analysis and Processing (ICIAP) [Abs]
The advent of embedded stereo cameras based on low-power and compact devices such as FPGAs (Field Programmable Gate Arrays) has enabled to effectively address several computer vision problems. However, being the depth data generated by stereo algorithms affected by errors, reliable strategies to detect wrong disparity assignments by means of confidence measures are desirable. Recent works proved that confidence measures are also a powerful cue to improve the overall accuracy of stereo. Most approaches aimed at predicting match reliability rely on cost volume analysis, an information seldom available as output of most embedded depth sensors. Therefore, in this paper we analyze and evaluate strategies compatible with the constraints of embedded stereo cameras. In particular, we focus our attention on methods to infer match reliability inside depth sensors based on highly constrained computing architectures such as FPGAs. We quantitatively assess, on Middlebury 2014 and KITTI 2015 datasets, the impact of different design strategies for 16 confidence measures from the literature, suited for implementation on such embedded systems. Our evaluation shows that, compared to the confidence measures typically deployed in this context and based on storing intermediate results, other approaches yield much more accurate predictions with negligible computing requirements and memory footprint. This enables for their implementation even on highly constrained architectures.

2016

Learning from scratch a confidence measure. Poggi Matteo, and Mattoccia Stefano In British Machine Vision Conference (BMVC) [Abs]
Stereo vision is a popular technique to infer depth from two or more images. In this field, confidence measures, typically obtained from the analysis of the cost volume, aim at detecting uncertain disparity assignments. As recently proved, multiple confidence measures combined with hand-crafted features extracted from the cost volume can be used also for other purposes and in particular to improve the overall disparity accuracy leveraging on machine learning techniques. In this paper, starting from the observation that recurrent local patterns occurring in the disparity maps can tell a correct assignment from a wrong one, we follow a completely different methodology to infer a novel confidence measure from scratch. Specifically, leveraging on Convolutional Neural Networks, we pose the confidence formulation as a regression problem by analyzing the disparity map provided by a stereo vision system. Once trained on a subset of the KITTI 2012 dataset with the disparity maps provided by the simple block-matching algorithm, our confidence measure outperforms state-of-the-art with two datasets (KITTI 2015 and Middlebury 2014) as well as with two stereo algorithms. The experimental evaluation reported clearly highlights that our approach is capable to better generalize its behavior in different circumstances with respect to state-of-the-art. Finally, not being based on cost volume analysis, our proposal is also potentially suited for out-of-the-box depth generation devices which usually do not expose the cues required by top-performing approaches.
Learning a general-purpose confidence measure based on o (1) features and a smarter aggregation strategy for semi global matching Poggi Matteo, and Mattoccia Stefano In International Conference on 3D Vision (3DV, Oral) [Abs]
Inferring dense depth from stereo is crucial for several computer vision applications and Semi Global Matching (SGM) is often the preferred choice due to its good tradeoff between accuracy and computation requirements. Nevertheless, it suffers of two major issues: streaking artifacts caused by the Scanline Optimization (SO) approach, at the core of this algorithm, may lead to inaccurate results and the high memory footprint that may become prohibitive with high resolution images or devices with constrained resources. In this paper, we propose a smart scanline aggregation approach for SGM aimed at dealing with both issues. In particular, the contribution of this paper is threefold: i) leveraging on machine learning, proposes a novel generalpurpose confidence measure suited for any for stereo algorithm, based on O(1) features, that outperforms state of-the-art ii) taking advantage of this confidence measure proposes a smart aggregation strategy for SGM enabling significant improvements with a very small overhead iii) the overall strategy drastically reduces the memory footprint of SGM and, at the same time, improves its effectiveness and execution time. We provide extensive experimental results, including a cross-validation with multiple datasets (KITTI 2012, KITTI 2015 and Middlebury 2014).
Deep stereo fusion: combining multiple disparity hypotheses with deep-learning Poggi Matteo, and Mattoccia Stefano In International Conference on 3D Vision (3DV) [Abs]
Stereo matching is a popular technique to infer depth from two or more images and wealth of methods have been proposed to deal with this problem. Despite these efforts, finding accurate stereo correspondences is still an open problem. The strengths and weaknesses of existing methods are often complementary and in this paper, motivated by recent trends in this field, we exploit this fact by proposing Deep Stereo Fusion, a Convolutional Neural Network capable of combining the output of multiple stereo algorithms in order to obtain more accurate result with respect to each input disparity map. Deep Stereo Fusion process a 3D features vector, encoding both spatial and cross-algorithm information, in order to select the best disparity hypothesis among those proposed by the single stereo matchers. To the best of our knowledge, our proposal is the first i) to leverage on deep learning and ii) able to predict the optimal disparity assignments by taking only as input cue the disparity maps. This second feature makes our method suitable for deployment even when other cues (e.g., confidence) are not available such as when dealing with disparity maps provided by off-the-shelf 3D sensors. We thoroughly evaluate our proposal on the KITTI stereo benchmark with respect state-of-the-art in this field.
A wearable mobility aid for the visually impaired based on embedded 3d vision and deep learning Poggi Matteo, and Mattoccia Stefano In 2016 IEEE Symposium on Computers and Communication (ISCC) [Abs]
In this paper we propose an effective and wearable mobility aid for people suffering of visual impairments purely based on 3D computer vision and machine learning techniques. By wearing our device the users can perceive, guided by audio messages and tactile feedback, crucial information concerned with the surrounding environment and hence avoid obstacles along the path. Our proposal can work in synergy with the white cane and allows for very effective and real-time obstacle detection on an embedded computer, by processing the point-cloud provided by a custom RGBD sensor, based on passive stereo vision. Moreover, our system, leveraging on deep-learning techniques, enables to semantically categorize the detected obstacles in order to increase the awareness of the explored environment. It can optionally work in synergy with a smartphone, wirelessly connected to the the proposed mobility aid, exploiting its audio capability and standard GPS-based navigation tools such as Google Maps. The overall system can operate in real-time for hours using a small battery, making it suitable for everyday life. Experimental results confirmed that our proposal has excellent obstacle detection performance and has a promising semantic categorization capability.
Evaluation of variants of the sgm algorithm aimed at implementation on embedded or reconfigurable devices Poggi Matteo, and Mattoccia Stefano In 2016 International Conference on 3D Imaging (IC3D) [Abs]
Inferring dense depth from stereo is crucial for several computer vision applications and stereo cameras based on embedded systems and/or reconfigurable devices such as FPGA became quite popular in the past years. In this field Semi Global Matching (SGM) is, in most cases, the preferred algorithm due to its good trade-off between accuracy and computation requirements. Nevertheless, a careful design of the processing pipeline enables significant improvements in terms of disparity map accuracy, hardware resources and frame rate. In particular factors like the amount of matching costs and parameters, such as the number/selection of scanlines, and so on have a great impact on the overall resource requirements. In this paper we evaluate different variants of the SGM algorithm suited for implementation on embedded or reconfigurable devices looking for the best compromise in terms of resource requirements, accuracy of the disparity estimation and running time. To assess quantitatively the effectiveness of the considered variants we adopt the KITTI 2015 training dataset, a challenging and standard benchmark with ground truth containing several realistic scenes.
Improving the reliability of 3D people tracking system by means of deep-learning Boschini Matteo, Poggi Matteo, and Mattoccia Stefano In 2016 International Conference on 3D Imaging (IC3D) [Abs]
People tracking is a crucial task in most computer vision applications aimed at analyzing specific behaviors in the sensed area. Practical applications include vision analytics, people counting, etc. In order to properly follow the actions of a single subject, a people tracking framework needs to robustly recognize it from the rest of the surrounding environment, thus allowing proper management of changing positions, occlusions and so on. The recent widespread diffusion of deep learning techniques on almost any kind of computer vision application provides a powerful methodology to address recognition. On the other hand, a large amount of data is required to train state-of-the-art Convolutional Neural Networks (CNN) and this problem is solved, when possible, by means of transfer learning. In this paper, we propose a novel dataset made of nearly 26 thousand samples acquired with a custom stereo camera providing depth according to a fast and accurate stereo algorithm. The dataset includes sequences acquired in different environments with more than 20 different people moving across the sensed area. Once labeled the 26 K images and depth maps of the dataset, we train a head detection module based on state-of-the-art deep network on a portion of the dataset and validate it a different sequence. Finally, we include the head detection module within an existing 3D tracking framework showing that the proposed approach notably improves people detection and tracking accuracy.

2015

Crosswalk recognition through point-cloud processing and deep-learning suited to a wearable mobility aid for the visually impaired Poggi Matteo, Nanni Luca, and Mattoccia Stefano In International Conference on Image Analysis and Processing (ICIAP) [Abs]
In smart-cities, computer vision has the potential to dramatically improve the quality of life of people suffering of visual impairments. In this field, we have been working on a wearable mobility aid aimed at detecting in real-time obstacles in front of a visually impaired. Our approach relies on a custom RGBD camera, with FPGA on-board processing, worn as traditional eyeglasses and effective point-cloud processing implemented on a compact and lightweight embedded computer. This latter device also provides feedback to the user by means of an haptic interface as well as audio messages. In this paper we address crosswalk recognition that, as pointed out by several visually impaired users involved in the evaluation of our system, is a crucial requirement in the design of an effective mobility aid. Specifically, we propose a reliable methodology to detect and categorize crosswalks by leveraging on point-cloud processing and deep-learning techniques. The experimental results reported, on 10000+ frames, confirm that the proposed approach is invariant to head/camera pose and extremely effective even when dealing with large occlusions typically found in urban environments.
A passive RGBD sensor for accurate and real-time depth sensing self-contained into an FPGA Mattoccia Stefano, and Poggi Matteo In International Conference on Distributed Smart Cameras (ICDSC) [Abs]
In this paper we describe the strategy adopted to design, from scratch, an embedded RGBD sensor for accurate and dense depth perception on a low-cost FPGA. This device infers, at more than 30 Hz, dense depth maps according to a state-of-the-art stereo vision processing pipeline entirely mapped into the FPGA without buffering partial results on external memories. The strategy outlined in this paper enables accurate depth computation with a low latency and a simple hardware design. On the other hand, it poses major constraints to the computing structure of the algorithms that fit with this simplified architecture and thus, in this paper, we discuss the solutions devised to overcome these issues. We report experimental results concerned with practical application scenarios in which the proposed RGBD sensor provides accurate and real-time depth sensing suited for the embedded vision domain.