Software/Datasets

GG-SSMs: Graph-Generating State Space Models

State Space Models (SSMs) are powerful tools for modeling sequential data in computer vision and time series analysis domains. However, traditional SSMs are limited by fixed, one-dimensional sequential processing, which restricts their ability to model non-local interactions in high-dimensional data. While methods like Mamba and VMamba introduce selective and flexible scanning strategies, they rely on predetermined paths, which fails to efficiently capture complex dependencies. We introduce Graph-Generating State Space Models (GG-SSMs), a novel framework that overcomes these limitations by dynamically constructing graphs based on feature relationships. Using Chazelle's Minimum Spanning Tree algorithm, GG-SSMs adapt to the inherent data structure, enabling robust feature propagation across dynamically generated graphs and efficiently modeling complex dependencies. We validate GG-SSMs on 11 diverse datasets, including event-based eye-tracking, ImageNet classification, optical flow estimation, and six time series datasets. GG-SSMs achieve state-of-the-art performance across all tasks, surpassing existing methods by significant margins. Specifically, GG-SSM attains a top-1 accuracy of 84.9% on ImageNet, outperforming prior SSMs by 1%, reducing the KITTI-15 error rate to 2.77%, and improving eye-tracking detection rates by up to 0.33% with fewer parameters. These results demonstrate that dynamic scanning based on feature relationships significantly improves SSMs' representational power and efficiency, offering a versatile tool for various applications in computer vision and beyond.

References

Nikola Zubić and Davide Scaramuzza

GG-SSMs: Graph-Generating State Space Models

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 2025.

PDF Code

Student-Informed Teacher Training

Imitation learning with a privileged teacher has proven effective for learning complex control behaviors from high-dimensional inputs, such as images. In this framework, a teacher is trained with privileged task information, while a student tries to predict the actions of the teacher with more limited observations, e.g., in a robot navigation task, the teacher might have access to distances to nearby obstacles, while the student only receives visual observations of the scene. However, privileged imitation learning faces a key challenge: the student might be unable to imitate the teacher's behavior due to partial observability. This problem arises because the teacher is trained without considering if the student is capable of imitating the learned behavior. To address this teacher-student asymmetry, we propose a framework for joint training of the teacher and student policies, encouraging the teacher to learn behaviors that can be imitated by the student despite the latters' limited access to information and its partial observability. Based on the performance bound in imitation learning, we add (i) the approximated action difference between teacher and student as a penalty term to the reward function of the teacher, and (ii) a supervised teacher-student alignment step. We motivate our method with a maze navigation task and demonstrate its effectiveness on complex vision-based quadrotor flight and manipulation tasks.

References

Nico Messikommer*, Jiaxu Xing*, Elie Aljalbout, Davide Scaramuzza

Student-Informed Teacher Training

International Conference on Learning Representations (ICLR), 2025.

Spotlight Presentation.

PDF Project Page Code Video

Data-driven Feature Tracking for Event Cameras with and without Frames

Because of their high temporal resolution, increased resilience to motion blur, and very sparse output, event cameras have been shown to be ideal for low-latency and low-bandwidth feature tracking, even in challenging scenarios. Existing feature tracking methods for event cameras are either handcrafted or derived from first principles but require extensive parameter tuning, are sensitive to noise, and do not generalize to different scenarios due to unmodeled effects. To tackle these deficiencies, we introduce the first data-driven feature tracker for event cameras, which leverages low-latency events to track features detected in an intensity frame. We achieve robust performance via a novel frame attention module, which shares information across feature tracks. Our tracker is designed to operate in two distinct configurations: solely with events or in a hybrid mode incorporating both events and frames. The hybrid model offers two setups: an aligned configuration where the event and frame cameras share the same viewpoint, and a hybrid stereo configuration where the event camera and the standard camera are positioned side by side. This side-by-side arrangement is particularly valuable as it provides depth information for each feature track, enhancing its utility in applications such as visual odometry and simultaneous localization and mapping.

S. Klenk, L. Koestler, D. Scaramuzza, D. Cremers

E-NeRF: Neural Radiance Fields from a Moving Event Camera

IEEE Robotics and Automation Letters (RA-L), 2023.

PDF Code

End-to-End Learned Event- and Image-based Visual Odometry

Visual Odometry (VO) is crucial for autonomous robotic navigation, especially in GPS-denied environments like planetary terrains. To improve robustness, recent model-based VO systems have begun combining standard and event-based cameras. While event cameras excel in low-light and high-speed motion, standard cameras provide dense and easier-to-track features. However, the field of image- and event-based VO still predominantly relies on model-based methods and is yet to fully integrate recent image-only advancements leveraging end-to-end learning-based architectures. Seamlessly integrating the two modalities remains challenging due to their different nature, one asynchronous, the other not, limiting the potential for a more effective image- and event-based VO. We introduce RAMP-VO, the first end-to-end learned image- and event-based VO system. It leverages novel Recurrent, Asynchronous, and Massively Parallel (RAMP) encoders capable of fusing asynchronous events with image data, providing 8x faster inference and 33% more accurate predictions than existing solutions. Despite being trained only in simulation, RAMP-VO outperforms previous methods on the newly introduced Apollo and Malapert datasets, and on existing benchmarks, where it improves image- and event-based methods by 58.8% and 30.6%, paving the way for robust and asynchronous VO in space.

References

Roberto Pellerito, Marco Cannici, Daniel Gehrig, Joris Belhadj, Olivier Dubois-Matra, Massimo Casasco, Davide Scaramuzza

Deep Visual Odometry with Events and Frames

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024.

PDF Code and Data Video

Reinforcement Learning Meets Visual Odometry

Visual Odometry (VO) is essential to downstream mobile robotics and augmented/virtual reality tasks. Despite recent advances, existing VO methods still rely on heuristic design choices that require several weeks of hyperparameter tuning by human experts, hindering generalizability and robustness. We address these challenges by reframing VO as a sequential decision-making task and applying Reinforcement Learning (RL) to adapt the VO process dynamically. Our approach introduces a neural network, operating as an agent within the VO pipeline, to make decisions such as keyframe and grid-size selection based on real-time conditions. Our method minimizes reliance on heuristic choices using a reward function based on pose error, runtime, and other metrics to guide the system. Our RL framework treats the VO system and the image sequence as an environment, with the agent receiving observations from keypoints, map statistics, and prior poses. Experimental results using classical VO methods and public benchmarks demonstrate improvements in accuracy and robustness, validating the generalizability of our RL-enhanced VO approach to different scenarios. We believe this paradigm shift advances VO technology by eliminating the need for time-intensive parameter tuning of heuristics.

References

Nico Messikommer*, Giovanni Cioffi*, Mathias Gehrig, Davide Scaramuzza

Reinforcement Learning Meets Visual Odometry

European Conference on Computer Vision (ECCV), 2024.

PDF Video Code

A Hybrid ANN-SNN Architecture for Low-Power and Low-Latency Visual Perception

Spiking Neural Networks (SNN) are a class of bioinspired neural networks that promise to bring low-power and low-latency inference to edge-devices through the use of asynchronous and sparse processing. However, being temporal models, SNNs depend heavily on expressive states to generate predictions on par with classical artificial neural networks (ANNs). These states converge only after long transient time periods, and quickly decay in the absence of input data, leading to higher latency, power consumption, and lower accuracy. In this work, we address this issue by initializing the state with an auxiliary ANN running at a low rate. The SNN then uses the state to generate predictions with high temporal resolution until the next initialization phase. Our hybrid ANN-SNN model thus combines the best of both worlds: It does not suffer from long state transients and state decay thanks to the ANN, and can generate predictions with high temporal resolution, low latency, and low power thanks to the SNN. We show for the task of eventbased 2D and 3D human pose estimation that our method consumes 88% less power with only a 4% decrease in performance compared to its fully ANN counterparts when run at the same inference rate. Moreover, when compared to SNNs, our method achieves a 74% lower error. This research thus provides a new understanding of how ANNs and SNNs can be used to maximize their respective benefits.

References

Asude Aydin, Mathias Gehrig, Daniel Gehrig, Davide Scaramuzza

A Hybrid ANN-SNN Architecture for Low-Power and Low-Latency Visual Perception

IEEE Conference on Computer Vision and Pattern Recognition Workshops(CVPRW), 2024.

PDF Code

State Space Models for Event Cameras

Today, state-of-the-art deep neural networks that process event-camera data first convert a temporal window of events into dense, grid-like input representations. As such, they exhibit poor generalizability when deployed at higher inference frequencies (i.e., smaller temporal windows) than the ones they were trained on. We address this challenge by introducing state-space models (SSMs) with learnable timescale parameters to event-based vision. This design adapts to varying frequencies without the need to retrain the network at different frequencies. Additionally, we investigate two strategies to counteract aliasing effects when deploying the model at higher frequencies. We comprehensively evaluate our approach against existing methods based on RNN and Transformer architectures across various benchmarks, including Gen1 and 1 Mpx event camera datasets. Our results demonstrate that SSM-based models train 33% faster and also exhibit minimal performance degradation when tested at higher frequencies than the training input. Traditional RNN and Transformer models exhibit performance drops of more than 20 mAP, with SSMs having a drop of 3.76 mAP, highlighting the effectiveness of SSMs in event-based vision tasks.

References

Nikola Zubić, Mathias Gehrig, Davide Scaramuzza

State Space Models for Event Cameras

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 2024.

Spotlight Presentation.

PDF Code Video

An N-Point Linear Solver for Line and Motion Estimation with Event Cameras

Event cameras respond primarily to edges-formed by strong gradients-and are thus particularly well-suited for line-based motion estimation. Recent work has shown that events generated by a single line each satisfy a polynomial constraint which describes a manifold in the space-time volume. Multiple such constraints can be solved simultaneously to recover the partial linear velocity and line parameters. In this work, we show that, with a suitable line parametrization, this system of constraints is actually linear in the unknowns, which allows us to design a novel linear solver. Unlike existing solvers, our linear solver (i) is fast and numerically stable since it does not rely on expensive root finding, (ii) can solve both minimal and overdetermined systems with more than 5 events (i.e. N >= 5), and (iii) admits the characterization of all degenerate cases and multiple solutions. The found line parameters are singularity-free and have a fixed scale, which eliminates the need for auxiliary constraints typically encountered in previous work. To recover the full linear camera velocity we fuse observations from multiple lines with a novel velocity averaging scheme that relies on a geometrically-motivated residual, and thus solves the problem more efficiently than previous schemes which minimize an algebraic residual. Extensive experiments in synthetic and real-world settings demonstrate that our method surpasses the previous work in numerical stability, and operates over 600 times faster.

References

Ling Gao, Daniel Gehrig, Hang Su, Davide Scaramuzza, Laurent Kneip

Y. Liu, M. Gehrig, N. Messikommer, M. Cannici, D. Scaramuzza

Revisiting Token Pruning for Object Detection and Instance Segmentation

IEEE Winter Conference on Applications of Computer Vision (WACV), 2024.

PDF Code Video

From Chaos Comes Order: Ordering Event Representations for Object Recognition and Detection

Selecting dense event representations for deep neural networks is exceedingly slow since it involves training a neural network for each representation and selecting the best one based on the validation score. In this work, we eliminate this bottleneck by selecting the representation based on the Gromov-Wasserstein Discrepancy (GWD) on the validation set. This metric is 200 times faster to compute and preserves the task performance ranking of event representations across multiple representations, network backbones, datasets and tasks. We use it to, for the first time, perform a hyperparameter search on a large family of event representations, revealing new and powerful event representations that exceed the state-of-the-art. Our optimized representations outperform existing representations by 1.7 mAP on the 1 Mpx dataset and 0.3 mAP on the Gen1 dataset, two established object detection benchmarks, and reach a 3.8% higher classification score on the mini N-ImageNet benchmark. Moreover, we outperform state-of-the-art by 2.1 mAP on Gen1 and state-of-the-art feed-forward methods by 6.0 mAP on the 1 Mpx datasets. This work opens a new unexplored field of explicit representation optimization for event-based learning.

References

Nikola Zubić, Daniel Gehrig, Mathias Gehrig, Davide Scaramuzza

From Chaos Comes Order: Ordering Event Representations for Object Recognition and Detection

IEEE/CVF International Conference on Computer Vision (ICCV), 2023.

PDF Code

Autonomous Power Line Inspection with Drones via Perception-Aware MPC

Drones have the potential to revolutionize power line inspection by increasing productivity, reducing inspection time, improving data quality, and eliminating the risks for human operators. Current state-of-the-art systems for power line inspection have two shortcomings: (i) control is decoupled from perception and needs accurate information about the location of the power lines and masts; (ii) collision avoidance is decoupled from the power line tracking, which results in poor tracking in the vicinity of the power masts, and, consequently, in decreased data quality for visual inspection. In this work, we propose a model predictive controller (MPC) that overcomes these limitations by tightly coupling perception and action. Our controller generates commands that maximize the visibility of the power lines while, at the same time, safely avoiding the power masts. For power line detection, we propose a lightweight learning-based detector that is trained only on synthetic data and is able to transfer zero-shot to real-world power line images. We validate our system in simulation and real-world experiments on a mock-up power line infrastructure. We release the code and dataset open-source.

References

Learned Inertial Odometry for Autonomous Drone Racing

J.Xing*, G. Cioffi*, J. Hidalgo-Carrió, D. Scaramuzza

Autonomous Power Line Inspection with Drones via Perception-Aware MPC

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023.

Best Paper Award!

PDF YouTube Code

Champion-level Drone Racing using Deep Reinforcement Learning

First-person view (FPV) drone racing is a televised sport in which professional competitors pilot high-speed aircraft through a three-dimensional circuit. Each pilot sees the environment from their drone's perspective via video streamed from an onboard camera. Reaching the level of professional pilots with an autonomous drone is challenging since the robot needs to fly at its physical limits while estimating its speed and location in the circuit exclusively from onboard sensors. Here we introduce Swift, an autonomous system that can race physical vehicles at the level of the human world champions. The system combines deep reinforcement learning in simulation with data collected in the physical world. Swift competed against three human champions, including the world champions of two international leagues, in real-world head-to-head races. Swift won multiple races against each of the human champions and demonstrated the fastest recorded race time. This work represents a milestone for mobile robotics and machine intelligence, which may inspire the deployment of hybrid learning-based solutions in other physical systems.

References

Elia Kaufmann, Leonard Bauersfeld, Antonio Loquercio, Matthias Müller, Vladlen Koltun, Davide Scaramuzza

Champion-level Drone Racing using Deep Reinforcement Learning

Nature, 2023

PDF YouTube (Ours) YouTube (Nature) Dataset

Active Exposure Control for Robust Visual Odometry in HDR Environments

We propose an active exposure control method to improve the robustness of visual odometry in HDR (high dynamic range) environments. Our method evaluates the proper exposure time by maximizing a robust gradient-based image quality metric. The optimization is achieved by exploiting the photometric response function of the camera. Our exposure control method is evaluated in different real world environments and outperforms the built-in auto-exposure function of the camera. To validate the benefit of our approach, we adapt a state-of-the-art visual odometry pipeline (SVO) to work with varying exposure time and demonstrate improved performance using our exposure control method in challenging HDR environments. We release the code open-source.

References

Leyla Loued-Khenissi*, Christian Pfeiffer*, Rupal Saxena, Shivam Adarsh, Davide Scaramuzza

Microgravity induces overconfidence in perceptual decision-making

Nature Scientific Reports, 2023.

PDF YouTube Dataset

Recurrent Vision Transformers for Object Detection with Event Cameras

We present Recurrent Vision Transformers (RVTs), a novel backbone for object detection with event cameras. Event cameras provide visual information with sub-millisecond latency at a high-dynamic range and with strong robustness against motion blur. These unique properties offer great potential for low-latency object detection and tracking in time-critical scenarios. Prior work in event-based vision has achieved outstanding detection performance but at the cost of substantial inference time, typically beyond 40 milliseconds. By revisiting the high-level design of recurrent vision backbones, we reduce inference time by a factor of 5 while retaining similar performance. To achieve this, we explore a multi-stage design that utilizes three key concepts in each stage: First, a convolutional prior that can be regarded as a conditional positional embedding. Second, local- and dilated global self-attention for spatial feature interaction. Third, recurrent temporal feature aggregation to minimize latency while retaining temporal information. RVTs can be trained from scratch to reach state-of-the-art performance on event-based object detection - achieving an mAP of 47.2% on the Gen1 automotive dataset. At the same time, RVTs offer fast inference (12 ms on a T4 GPU) and favorable parameter efficiency (5 times fewer than prior art). Our study brings new insights into effective design choices that could be fruitful for research beyond event-based vision.

References

Giovanni Cioffi, Davide Scaramuzza

Tightly-coupled Fusion of Global Positional Measurements in Optimization-based Visual-Inertial Odometry

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, 2020.

PDF YouTube Code

Data-driven Feature Tracking for Event Cameras

Because of their high temporal resolution, increased resilience to motion blur, and very sparse output, event cameras have been shown to be ideal for low-latency and low-bandwidth feature tracking, even in challenging scenarios. Existing feature tracking methods for event cameras are either handcrafted or derived from first principles but require extensive parameter tuning, are sensitive to noise, and do not generalize to different scenarios due to unmodeled effects. To tackle these deficiencies, we introduce the first data-driven feature tracker for event cameras, which leverages low-latency events to track features detected in a grayscale frame. We achieve robust performance via a novel frame attention module, which shares information across feature tracks. By directly transferring zero-shot from synthetic to real data, our data-driven tracker outperforms existing approaches in relative feature age by up to 120 % while also achieving the lowest latency. This performance gap is further increased to 130 % by adapting our tracker to real data with a novel self-supervision strategy.

References

Nico Messikommer*, Carter Fang*, Mathias Gehrig, Davide Scaramuzza

V. Polizzi, R. Hewitt, J. Hidalgo-Carrió, J. Delaune and D. Scaramuzza

Data-Efficient Collaborative Decentralized Thermal-Inertial Odometry

IEEE Robotics and Automation Letters (RA-L), 2022

PDF Poster Video Code & Datasets

The Hilti SLAM Challenge Dataset

In this paper, we present the first state estimation pipeline that leverages the complementary advantages of a standard camera with an event camera by fusing in a tightly-coupled manner events, standard frames, and inertial measurements. We show on the Event Camera Dataset that our hybrid pipeline leads to an accuracy improvement of 130% over event-only pipelines, and 85% over standard-frames only visual-inertial systems, while still being computationally tractable.

Furthermore, we use our pipeline to demonstrate - to the best of our knowledge - the first autonomous quadrotor flight using an event camera for state estimation, unlocking flight scenarios that were not reachable with traditional visual inertial odometry, such as low-light environments and high dynamic range scenes.

References

A. Rosinol Vidal, H.Rebecq, T. Horstschaefer, D. Scaramuzza

Ultimate SLAM? Combining Events, Images, and IMU for Robust Visual SLAM in HDR and High Speed Scenarios

IEEE Robotics and Automation Letters (RA-L), 2018.

PDF YouTube ICRA18 Video Pitch Poster Results (raw trajectories) Project Webpage Source Code

Event-aided Direct Sparse Odometry

We introduce EDS, a direct monocular visual odometry using events and frames. Our algorithm leverages the event generation model to track the camera motion in the blind time between frames. The method formulates a direct probabilistic approach of observed brightness increments. Per-pixel brightness increments are predicted using a sparse number of selected 3D points and are compared to the events via the brightness increment error to estimate camera motion. The method recovers a semi-dense 3D map using photometric bundle adjustment. EDS is the first method to perform 6-DOF VO using events and frames with a direct approach. By design it overcomes the problem of changing appearance in indirect methods. We also show that, for a target error performance, EDS can work at lower frame rates than state-of-the-art frame-based VO solutions. This opens the door to low-power motion-tracking applications where frames are sparingly triggered "on demand'' and our method tracks the motion in between. We release code and datasets to the public.

References

J. Hidalgo-Carrió, G.Gallego, D. Scaramuzza

Event-aided Direct Sparse Odometry

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

Oral Presentation.

PDF YouTube Code Poster Dataset CVPR Video

Time Lens++: Event-based Frame Interpolation with Parametric Non-linear Flow and Multi-scale Fusion

Recently, video frame interpolation using a combination of frame- and event-based cameras has surpassed traditional image-based methods both in terms of performance and memory efficiency. However, current methods still suffer from (i) brittle image-level fusion of complementary interpolation results, that fails in the presence of artifacts in the fused image, (ii) potentially temporally inconsistent and inefficient motion estimation procedures, that run for every inserted frame and (iii) low contrast regions that do not trigger events, and thus cause events-only motion estimation to generate artifacts. Moreover, previous methods were only tested on datasets consisting of planar and faraway scenes, which do not capture the full complexity of the real world. In this work, we address the above problems by introducing multi-scale feature-level fusion and computing one-shot non-linear inter-frame motion—which can be efficiently sampled for image warping—from events and images. We also collect the first large-scale events and frames dataset consisting of more than 100 challenging scenes with depth variations, captured with a new experimental setup based on a beamsplitter. We show that our method improves the reconstruction quality by up to 0.2 dB in terms of PSNR and up to 15% in LPIPS score.

References

AEGNN: Asynchronous Event-based Graph Neural Networks

S. Tulyakov, A. Bochicchio, D. Gehrig, S. Georgoulis, Y. Li, D. Scaramuzza

Time Lens++: Event-based Frame Interpolation with Parametric Non-linear Flow and Multi-scale Fusion

IEEE Conference of Computer Vision and Pattern Recognition (CVPR), 2022, New Orleans, USA.

PDF YouTube Dataset Project Webpage

AEGNN: Asynchronous Event-based Graph Neural Networks

The best performing learning algorithms devised for event cameras work by first converting events into dense representations that are then processed using standard CNNs. However, these steps discard both the sparsity and high temporal resolution of events, leading to high computational burden and latency. For this reason, recent works have adopted Graph Neural Networks (GNNs), which process events as "static" spatio-temporal graphs, which are inherently ”sparse”. We take this trend one step further by introducing Asynchronous, Event-based Graph Neural Networks (AEGNNs), a novel event-processing paradigm that generalizes standard GNNs to process events as "evolving" spatio-temporal graphs. AEGNNs follow efficient update rules that restrict recomputation of network activations only to the nodes affected by each new event, thereby significantly reducing both computation and latency for event- by-event processing. AEGNNs are easily trained on synchronous inputs and can be converted to efficient, ”asynchronous” networks at test time. We thoroughly validate our method on object classification and detection tasks, where we show an up to a 200-fold reduction in computational complexity (FLOPs), with similar or even better performance than state-of-the-art asynchronous methods. This reduction in computation directly translates to an 8-fold reduction in computational latency when compared to standard GNNs, which opens the door to low-latency event-based processing.

References

Y. Song, D. Scaramuzza

Policy Search for Model Predictive Control with Application to Agile Drone Flight

IEEE Transactions on Robotics (T-RO), 2022.

PDF YouTube Project Webpage Code

SVO Pro

We are excited to release fully open source SVO Pro! SVO Pro is the latest version of SVO developed over the past few years in our lab. SVO Pro features the support of different camera models, active exposure control, a sliding window based backend, and global bundle adjustment with loop closure. Check out the project page and the code on github!

References

C. Forster, Z. Zhang, M. Gassner, M. Werlberger, D. Scaramuzza

SVO: Semi-Direct Visual Odometry for Monocular and Multi-Camera Systems

IEEE Transactions on Robotics, Vol. 33, Issue 2, pages 249-265, Apr. 2017.

PDF YouTube Code

ESL: Event-based Structured Light

Event cameras are bio-inspired sensors providing significant advantages over standard cameras such as low latency, high temporal resolution, and high dynamic range. We propose a novel structured-light system using an event camera to tackle the problem of accurate and high-speed depth sensing. Our setup consists of an event camera and a laser-point projector that uniformly illuminates the scene in a raster scanning pattern during 16 ms. Previous methods match events independently of each other, and so they deliver noisy depth estimates at high scanning speeds in the presence of signal latency and jitter. In contrast, we optimize an energy function designed to exploit event correlations, called spatio-temporal consistency. The resulting method is robust to event jitter and therefore performs better at higher scanning speeds. Experiments demonstrate that our method can deal with high-speed motion and outperform state-of-the-art 3D reconstruction methods based on event cameras, reducing the RMSE by 83% on average, for the same acquisition time.

References

M.Muglikar, G. Gallego D. Scaramuzza

ESL: Event-based Structured Light

International Conference on 3D Vision (3DV), 2021.

PDF Video Code Project Page Poster

E-RAFT: Dense Optical Flow from Event Cameras

We propose to incorporate feature correlation and sequential processing into dense optical flow estimation from event cameras. Modern frame-based optical flow methods heavily rely on matching costs computed from feature correlation. In contrast, there exists no optical flow method for event cameras that explicitly computes matching costs. Instead, learning-based approaches using events usually resort to the U-Net architecture to estimate optical flow sparsely. Our key finding is that introducing correlation features significantly improves results compared to previous methods that solely rely on convolution layers. Compared to the state-of-the-art, our proposed approach computes dense optical flow and reduces the end-point error by 23% on MVSEC. Furthermore, we show that all existing optical flow methods developed so far for event cameras have been evaluated on datasets with very small displacement fields with a maximum flow magnitude of 10 pixels. We introduce a new real-world dataset that exhibits displacement fields with magnitudes up to 210 pixels and 3 times higher camera resolution based on this observation. Our proposed approach reduces the end-point error on this dataset by 66%.

References

M. Gehrig, M. Millhaeusler, D. Gehrig, D. Scaramuzza

E-RAFT: Dense Optical Flow from Event Cameras

International Conference on 3D Vision (3DV), 2021.

Oral Presentation. Oral Acceptance Rate: 13.2%.

Project Page PDF Code Dataset Benchmark Youtube

Learning High-Speed Flight in the Wild

This is the algorithm presented in our Science Robotics paper Learning High-Speed Flight in the Wild. Check out the code here. The code allows you to train end-to-end navigation policies to fly drones in previously unknown, challenging environments (snowy terrains, derailed trains, ruins, thick vegetation, and collapsed buildings), with only onboard sensing and computation. For more details, check out our paper.

References

M.Muglikar*,M. Gehrig*, D. Gehrig, D. Scaramuzza

How to Calibrate Your Event Camera

IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), Nashville, 2021

PDF Video Code

AutoTune: Controller Tuning for High-Speed Flight

The following code allows you to automatically tune your controller on the task of high speed flight, where our approach obtains superior performance than the state of the art. In contrast to previous work, our algorithm does not assume any prior knowledge of the drone or of the optimization function and can deal with the multi-modal characteristics of the parameters' optimization space.

We propose AutoTune, a sampling method based on Metropolis-Hasting sampling that can tune controller to fly faster than ever before. Among others, AutoTune improves tracking error when flying a physical platform with respect to parameters tuned by a human expert.

Open-source Code on GitHub

References

A. Loquercio, A. Saviolo, D. Scaramuzza

AutoTune: Controller Tuning for High-Speed Flight

Arxiv Preprint, 2021.

PDF Code YouTube

DSEC: A Stereo Event Camera Dataset for Driving Scenarios

Once an academic venture, autonomous driving has received unparalleled corporate funding in the last decade. Still, operating conditions of current autonomous cars are mostly restricted to ideal scenarios. This means that driving in challenging illumination conditions such as night, sunrise, and sunset remains an open problem. In these cases, standard cameras are being pushed to their limits in terms of low light and high dynamic range performance. To address these challenges, we propose, DSEC, a new dataset that contains such demanding illumination conditions and provides a rich set of sensory data. DSEC offers data from a wide-baseline stereo setup of two color frame cameras and two high-resolution monochrome event cameras. In addition, we collect lidar data and RTK GPS measurements, both hardware synchronized with all camera data. One of the distinctive features of this dataset is the inclusion of high-resolution event cameras. Event cameras have received increasing attention for their high temporal resolution and high dynamic range performance. However, due to their novelty, event camera datasets in driving scenarios are rare. This work presents the first high resolution, large scale stereo dataset with event cameras. The dataset contains 53 sequences collected by driving in a variety of illumination conditions and provides ground truth disparity for the development and evaluation of event-based stereo algorithms.

References

M. Gehrig, W. Aarents, D. Gehrig, D. Scaramuzza

DSEC: A Stereo Event Camera Dataset for Driving Scenarios

IEEE Robotics and Automation Letters (RA-L), 2021.

PDF Project Page and Dataset Code Teaser ICRA 2021 Video Pitch Slides

Human-Piloted Drone Racing: Visual Processing and Control

Humans race drones faster than algorithms, despite being limited to a fixed camera angle, body rate control, and response latencies in the order of hundreds of milliseconds. A better understanding of the ability of human pilots of selecting appropriate motor commands from highly dynamic visual information may provide key insights for solving current challenges in vision-based autonomous navigation. This paper investigates the relationship between human eye movements, control behavior, and flight performance in a drone racing task. We collected a multimodal dataset from 21 experienced drone pilots using a highly realistic drone racing simulator, also used to recruit professional pilots. Our results show task-specific improvements in drone racing performance over time. In particular, we found that eye gaze tracks future waypoints (i.e., gates), with first fixations occurring on average 1.5 seconds and 16 meters before reaching the gate. Moreover, human pilots consistently looked at the inside of the future flight path for lateral (i.e., left and right turns) and vertical maneuvers (i.e., ascending and descending). Finally, we found a strong correlation between pilots eye movements and the commanded direction of quadrotor flight, with an average visual-motor response latency of 220 ms. These results highlight the importance of coordinated eye movements in human-piloted drone racing. We make our dataset publicly available.

Open-source Dataset on OSF

References

C.Pfeiffer, D. Scaramuzza

Human-Piloted Drone Racing: Visual Processing and Control

IEEE Robotics and Automation Letters (RA-L), 2021.

PDF YouTube Slides Dataset

ESIM: an Open Event Camera Simulator now with GPU support!

Event cameras are revolutionary sensors that work radically differently from standard cameras. Instead of capturing intensity images at a fixed rate, event cameras measure changes of intensity asynchronously, in the form of a stream of events, which encode per-pixel brightness changes. In the last few years, their outstanding properties (asynchronous sensing, no motion blur, high dynamic range) have led to exciting vision applications, with very low-latency and high robustness.

Event cameras are powerful new sensors able to capture high dynamic range with microsecond temporal resolution and no motion blur. Their strength is detecting brightness changes (called events) rather than capturing direct brightness images; however, algorithms can be used to convert events into usable image representations for applications such as classification. Previous works rely on hand-crafted spatial and temporal smoothing techniques to reconstruct images from events. State-of-the-art video reconstruction has recently been achieved using neural networks that are large (10M parameters) and computationally expensive, requiring 30ms for a forward-pass at 640 x 480 resolution on a modern GPU. We propose a novel neural network architecture for video reconstruction from events that is smaller (38k vs. 10M parameters) and faster (10ms vs. 30ms) than state-of-the-art with minimal impact to performance.

Open-source Code on GitHub

References

C. Scheerlinck, H. Rebecq, D. Gehrig, N. Barnes, R. Mahony, D. Scaramuzza

Fast Image Reconstruction with an Event Camera

IEEE Winter Conference on Applications of Computer Vision (WACV), 2020.

PDF YouTube Code and Datasets

Learning Monocular Dense Depth from Events

If you are interested in deep learning and event cameras, you should try our code out! We propose a recurrent architecture to solve the depth prediction task and show significant improvement over standard feed-forward methods. In particular, our method generates dense depth predictions using a monocular setup, which has not been shown previously. We pretrain our model using a new dataset containing events and depth maps recorded in the CARLA simulator. We test our method on the Multi Vehicle Stereo Event Camera Dataset (MVSEC). The code allows you to benchmark our model and generate new training data.

Open-source Code on GitHub

References

J. Hidalgo-Carrio, D. Gehrig, D. Scaramuzza

Learning Monocular Dense Depth from Events

IEEE International Conference on 3D Vision (3DV), Fukuoka, 2020

PDF Code Dataset

Primal-Dual Mesh Convolutional Neural Networks

Event cameras are novel sensors that output brightness changes in the form of a stream of asynchronous "events" instead of intensity frames. They offer significant advantages with respect to conventional cameras: high dynamic range (HDR), high temporal resolution, and no motion blur. Recently, novel learning approaches operating on event data have achieved impressive results. Yet, these methods require a large amount of event data for training, which is hardly available due the novelty of event sensors in computer vision research. In this paper, we present a method that addresses these needs by converting any existing video dataset recorded with conventional cameras to \emph{synthetic} event data. This unlocks the use of a virtually unlimited number of existing video datasets for training networks designed for real event data. We evaluate our method on two relevant vision tasks, i.e., object recognition and semantic segmentation, and show that models trained on synthetic events have several benefits: (i) they generalize well to real event data, even in scenarios where standard-camera images are blurry or overexposed, by inheriting the outstanding properties of event cameras; (ii) they can be used for fine-tuning on real data to improve over state-of-the-art for both classification and semantic segmentation.

References

Daniel Gehrig, Mathias Gehrig, Javier Hidalgo-Carrio, Davide Scaramuzza

Video to Events: Bringing Modern Computer Vision Closer to Event Cameras

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 2020.

PDF YouTube CVPR20 Video Pitch Code

GPU-Accelerated Frontend for High-Speed VIO

The recent introduction of powerful embedded GPUs has enabled algorithms to run well above the standard video rates, yielding higher information processing capability and reduced latency. This code introduces an enhanced FAST feature detector that applies a GPU-specific non-maxima suppression, imposes spatial feature distribution and extracts features simultaneously. It runs on most modern NVIDIA CUDA-capable GPUS, but is further specialized with the Jetson TX2 platform in mind, on which it performs more than 1000fps throughput.

References

Balazs Nagy, Philipp Foehn, D. Scaramuzza

Faster than FAST: GPU-Accelerated Frontend for High-Speed VIO

Submitted to IEEE International Conference on Intelligent Robots and Systems, 2020.

PDF Code IROS20 Video Pitch

Event-Based Angular Velocity Regression with Spiking Networks

Spiking Neural Networks (SNNs) are bio-inspired networks that process information conveyed as temporal spikes rather than numeric values. These highly-parallelizable event-based networks are a prime candidate to learn patterns of spatio-temporal data as received from event cameras. The following code implements a spiking network that was trained to perform continuous-time regression of angular velocities directly from event-based data.

References

M. Gehrig, S. Shrestha, D. Mouritzen, D. Scaramuzza

Event-Based Angular Velocity Regression with Spiking Networks

IEEE International Conference on Robotics and Automation (ICRA), 2020

PDF Code

A General Framework for Uncertainty Estimation in Deep Learning

The following code presents a general framework for uncertainty estimation of deep neural network predictions. Our framework can compute uncertainties for every network architecture, does not require changes in the optimization process, and can be applied to already trained networks. Our framework's code is available at this page! This framework was develop in the context of our RA-L and ICRA 2020 paper A General Framework for Uncertainty Estimation in Deep Learning.

References

A. Loquercio, M. Segu, D. Scaramuzza

A General Framework for Uncertainty Estimation in Deep Learning

Robotics And Automation Letters, 2020.

PDF YouTube ICRA2020 Pitch Video Code

Event Camera Driving Sequences

The following driving dataset was recorded in the context of the paper High Speed and High Dynamic Range Video with an Event Camera. The datasets consists of a number of sequences that were recorded with a VGA (640x480) event camera (Samsung DVS Gen3) and a conventional RGB camera (Huawei P20 Pro) placed on the windshield of a car driving through Zurich.

We provide all event sequences as binary rosbag files for use with the Robot Operating System (ROS). The format is the one used by the RPG DVS ROS driver.

The driving datasets are available at this page.

References

H. Rebecq, R. Ranftl, V. Koltun, D. Scaramuzza

High Speed and High Dynamic Range Video with an Event Camera

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.

PDF YouTube Code Project Page

EKLT: Event-based KLT

EKLT: Asynchronous, Photometric Feature Tracking using Events and Frames

Event cameras are revolutionary sensors that work radically different from standard cameras. Instead of capturing intensity images at a fixed rate, event cameras measure changes of intensity asynchronously, in the form of a stream of events, which encode per-pixel brightness changes. In the last few years, their outstanding properties (asynchronous sensing, no motion blur, high dynamic range) have led to exciting vision applications, with very low-latency and high robustness.

We release EKLT, an event-based feature tracker published in our recent IJCV paper. It leverages the complementarity of event cameras and standard cameras to track visual features with low latency and in the blind time between two frames. The code is implemented in C++ and available under this link. We also provide a tool to evaluate event-based feature trackers and provide functionality to evaluate them tracks in real and simulated environments. The code is implemented in Python and can be used to easily generate paper-ready plots and videos.

T. Cieslewski, M. Bloesch, D. Scaramuzza

Matching Features without Descriptors:
Implicitly Matched Interest Points

British Machine Vision Conference (BMVC), Cardiff, 2019.

PDF Code and Data

High Speed and High Dynamic Range with an Event Camera

Event cameras are novel sensors that report brightness changes in the form of a stream of asynchronous events instead of intensity frames. They offer significant advantages with respect to conventional cameras: high temporal resolution, high dynamic range, and no motion blur. While the stream of events encodes in principle the complete visual signal, the reconstruction of an intensity image from a stream of events is an ill-posed problem in practice. Existing reconstruction approaches are based on hand-crafted priors and strong assumptions about the imaging process as well as the statistics of natural images.

In this work we propose to learn to reconstruct intensity images from event streams directly from data instead of relying on any hand-crafted priors. We propose a novel recurrent network to reconstruct videos from a stream of events, and train it on a large amount of simulated event data. During training we propose to use a perceptual loss to encourage reconstructions to follow natural image statistics. We further extend our approach to synthesize color images from color event streams.

Our quantitative experiments show that our network surpasses state-of-the-art reconstruction methods by a large margin in terms of image quality (> 20%), while comfortably running in real-time. We show that the network is able to synthesize high framerate videos (> 5,000 frames per second) of high-speed phenomena (e.g. a bullet hitting an object) and is able to provide high dynamic range reconstructions in challenging lighting conditions. As an additional contribution, we demonstrate the effectiveness of our reconstructions as an intermediate representation for event data. We show that off-the-shelf computer vision algorithms can be applied to our reconstructions for tasks such as object classification and visual-inertial odometry and that this strategy consistently outperforms algorithms that were specifically designed for event data. We release the reconstruction code and a pre-trained model to enable further research.

References

H. Rebecq, R. Ranftl, V. Koltun, D. Scaramuzza

High Speed and High Dynamic Range Video with an Event Camera

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.

PDF YouTube Code Project Page

CED: Color Event Camera Dataset

Event cameras are novel, bio-inspired visual sensors, whose pixels output asynchronous and independent timestamped spikes at local intensity changes, called "events". Event cameras offer advantages over conventional frame-based cameras in terms of latency, high dynamic range (HDR) and temporal resolution. Until recently, event cameras have been limited to outputting events in the intensity channel, however, recent advances have resulted in the development of color event cameras, such as the Color DAVIS346.

It is well known that visual-inertial state estimation is possible up to a four degrees-of-freedom (DoF) transformation (rotation around gravity and translation), and the extra DoFs ("gauge freedom") have to be handled properly. While different approaches for handling the gauge freedom have been used in practice, no previous study has been carried out to systematically analyze their differences. In this paper, we present the first comparative analysis of different methods for handling the gauge freedom in optimization-based visual-inertial state estimation. We experimentally compare three commonly used approaches: fixing the unobservable states to some given values, setting a prior on such states, or letting the states evolve freely during optimization. Specifically, we show that (i) the accuracy and computational time of the three methods are similar, with the free gauge approach being slightly faster; (ii) the covariance estimation from the free gauge approach appears dramatically different, but is actually tightly related to the other approaches. Our findings are validated both in simulation and on real-world datasets and can be useful for designing optimization-based visual-inertial state estimation algorithms.

Open-Source Code for Covariance Transformation

References

Z. Zhang, G, Gallego, D. Scaramuzza

On the Comparison of Gauge Freedom Handling in Optimization-based Visual-Inertial State Estimation

IEEE Robotics and Automation Letters (RA-L), 2018.

PDF PPT Code

DroNet: Learning to Fly by Driving

We provide a complete framework for flying quadrotors based on control algorithms developed by the Robotics and Perception Group. We also provide an interface to the RotorS Gazebo plugins to use our algorithms in simulation. Together with the provided simple trajectory generation library, this can be used to test and use our sofware in simulation only. We also provide some utility to command a quadrotor with a gamepad through our framework as well as some calibration routines to compensate for varying battery voltage. Finally, we provide an interface to communicate with flight controllers used for First-Person-View racing.

Project Page

SVO 2.0: Semi-Direct Visual Odometry

We provide the binaries of our semi-direct visual odometry algorithm, SVO 2.0. It can run up to 400 frames per second on a modern laptop and execute at 60 frames per second on a smartphone processor. The provided binaries suppport different camera models (pinhole, fisheye and catadioptric) and setups (monocular, stereo).

Binaries Download

Image Reconstruction from an Event Camera

The Event-Camera Dataset and Simulator: Event-based Data for Pose Estimation, Visual Odometry, and SLAM

International Journal of Robotics Research, Vol. 36, Issue 2, pages 142-149, Feb. 2017.

PDF (arXiv) YouTube Dataset

Information Gain Based Active Reconstruction Framework

The Information Gain Based Active Reconstruction Framework is a modular, robot-agnostic, software package for performing next-best-view planning for volumetric object reconstruction using a range sensor. Our implementation can be easily adapted to any mobile robot equipped with any camera-based range sensor (e.g stereo camera, structured light sensor) to iteratively observe an object to generate a volumetric map and a point cloud model. The algorithm allows the user to define the information gain metric for choosing the next best view, and many formulations for these metrics are evaluated and compared in our ICRA paper. This framework is released open source as a ROS-compatible package for autonomous 3D reconstruction tasks.

Download the code from GitHub.

Check out a video of the system in action on YouTube.

References

S. Isler, R. Sabzevari, J. Delmerico, D. Scaramuzza

An Information Gain Formulation for Active Volumetric 3D Reconstruction

IEEE International Conference on Robotics and Automation (ICRA), Stockholm, 2016.

PDF YouTube Software

Fisheye and Catadioptric Synthetic Datasets for Visual Odometry

We provide two synthetic scenes (vehicle moving in a city, and flying robot hovering in a confined room). For each scene, three different optics were used (perspective, fisheye and catadioptric), but the same sensor is used (keeping the image resolution constant). These datasets were generated using Blender, using a custom omnidirectional camera model, which we release as an open-source patch for Blender.

Download the datasets from here.

References

Z. Zhang, H. Rebecq, C. Forster, D. Scaramuzza

Benefit of Large Field-of-View Cameras for Visual Odometry

IEEE International Conference on Robotics and Automation (ICRA), Stockholm, 2016.

PDF YouTube Research page (datasets and software) C++ omnidirectional camera model

Indoor Dataset of Quadrotor with Down-Looking Camera

This dataset contains the recording of the raw images, IMU measurements as well as the ground truth poses of a quadrotor flying a circular trajectory in a office size environment.

Download dataset

REMODE: Real-time, Probabilistic, Monocular, Dense Reconstruction

REMODE is a novel method to estimate dense and accurate depth maps from a single moving camera. A probabilistic depth measurement is carried out in real time on a per-pixel basis and the computed uncertainty is used to reject erroneous estimations and provide live feedback on the reconstruction progress. REMODE uses a novel approach to depth map computation that combines Bayesian estimation and recent development on convex optimization for image processing. In the reference paper below, we demonstrate that our method outperforms state-of-the-art techniques in terms of accuracy, while exhibiting high efficiency in memory usage and computing power. Our CUDA-based implementation runs at 50Hz on a laptop computer and is released as open-source software (code here).

Download the code from GitHub.

References

M. Pizzoli, C. Forster, D. Scaramuzza

REMODE: Probabilistic, Monocular Dense Reconstruction in Real Time

IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, 2014.

PDF YouTube Software

SVO: Semi-direct Visual Odometry

SVO is a Semi-direct, monocular Visual Odometry algorithm that is precise, robust, and faster than current state-of-the-art methods. The semi-direct approach eliminates the need of costly feature extraction and robust matching techniques for motion estimation. SVO operates directly on pixel intensities, which results in subpixel precision at high frame-rates. A probabilistic mapping method that explicitly models outlier and depth uncertainty is used to estimate 3D points, which results in fewer outliers and more reliable points. Precise and high frame-rate motion estimation brings increased robustness in scenes of little, repetitive, and high-frequency texture. The algorithm is applied to micro-aerial-vehicle state-estimation in GPS-denied environments and runs at 55 frames per second on the onboard embedded computer and at more than 400 frames per second on anm i7 consumer laptop and more than 70 frames per second on a smartphone computer (e.g., Odroid or Samsung Galaxy phones).

Download the code from GitHub.

Reference

C. Forster, M. Pizzoli, D. Scaramuzza

SVO: Fast Semi-Direct Monocular Visual Odometry

IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, 2014.

PDF YouTube Software

ROS Driver and Calibration Tool for the Dynamic Vision Sensor (DVS)

The RPG DVS ROS Package allow to use the Dynamic Vision Sensor (DVS) within the Robot Operating System (ROS). It also contains a calibration tool for intrinsic and stereo calibration using a blinking pattern.

The code with instructions on how to use it is hosted on GitHub.

Authors: Elias Mueggler, Basil Huber, Luca Longinotti, Tobi Delbruck

References

E. Mueggler, B. Huber, D. Scaramuzza Event-based, 6-DOF Pose Tracking for High-Speed Maneuvers, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Chicago, 2014. [ PDF ]

A. Censi, J. Strubel, C. Brandli, T. Delbruck, D. Scaramuzza Low-latency localization by Active LED Markers tracking using a Dynamic Vision Sensor, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Tokyo, 2013. (PDF) [ PDF ]

P. Lichtsteiner, C. Posch, T. Delbruck A 128x128 120dB 15us Latency Asynchronous Temporal Contrast Vision Sensor, IEEE Journal of Solid State Circuits, Feb. 2008, 43(2), 566-576. [ PDF ]

A Monocular Pose Estimation System based on Infrared LEDs

Mutual localization is a fundamental component for multi-robot missions. Our monocular pose estimation system consists of multiple infrared LEDs and a camera with an infrared-pass filter. The LEDs are attached to the robot that we want to track, while the observing robot is equipped with the camera.

The code with instructions on how to use it is hosted on GitHub.

Reference

Matthias Faessler, Elias Mueggler, Karl Schwabe and Davide Scaramuzza, A Monocular Pose Estimation System based on Infrared LEDs, Proc. IEEE International Conference on Robotics and Automation (ICRA), 2014, Hong Kong. [ PDF ]

Torque Control of a KUKA youBot Arm

Existing control schemes for the KUKA youBot arm, such as directly controlling joint positions or velocities, are not suited for close tracking of end effector trajectories. A torque controller, based on the dynamical model of the youBot arm, was implemented to overcome this limitation. Complementary to the controller, a framework to automatically generate trajectories was developed.

The code with instructions on how to use it is hosted on GitHub. Details are provided in the Master Thesis of Benjamin Keiser.

Authors: Benjamin Keiser, Matthias Faessler, Elias Mueggler

Reference

B. Keiser, E. Mueggler, M. Faessler, D. Scaramuzza Torque Control of a KUKA youBot Arm, Master Thesis, University of Zurich, September, 2013. [ PDF ]

Dataset: Air-Ground Matching of Airborne images with Google Street View data

Matching airborne images to ground level ones is a challenging problem since in this case extreme changes in viewpoint and scale can be found between the aerial Micro Aerial Vehicle (MAV) images and the ground-level images, aside the challenges present in ground visual search algorithms used in UGV applications, such as illumination, lens distortions, over season variation of the vegetation, and scene changes between the query and the database images.

Our dataset consists of image data captured with a small quadroctopter flying in the streets of Zurich (up to 15 meters from the ground), along a path of 2km, including: (1) aerial MAV Images, (2) ground-level Google Street View Images, (3) ground-truth confusion matrix, and (4) GPS data (geotags) for every database image.

Download dataset.

Authors: Andras Majdik and Yves Albers-Schoenberg

Reference

A.L. Majdik, D. Verda, Y. Albers-Schoenberg, D. Scaramuzza Air-ground Matching: Appearance-based GPS-denied Urban Localization of Micro Aerial Vehicles Journal of Field Robotics, 2015. [ PDF ]

A. L. Majdik, D. Verda, Y. Albers-Schoenberg, D. Scaramuzza Micro Air Vehicle Localization and Position Tracking from Textured 3D Cadastral Models IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, 2014. [ PDF ]

A. Majdik, Y. Albers-Schoenberg, D. Scaramuzza. MAV Urban Localization from Google Street View Data, IROS'13, IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS'13, 2013. [ PDF ] [ PPT ]

Perspective 3-Point (P3P) Algorithm

The Perspective-Three-Point (P3P) problem aims at determining the position and orientation of a camera in the world reference frame from three 2D-3D point correspondences. Most solutions attempt to first solve for the position of the points in the camera reference frame, and then compute the point aligning transformation between the camera and the world frame. In contrast, this work proposes a novel closed-form solution to the P3P problem, which computes the aligning transformation directly in a single stage, without the intermediate derivation of the points in the camera frame. This is made possible by introducing intermediate camera and world reference frames, and expressing their relative position and orientation using only two parameters. The projection of a world point into the parametrized camera pose then leads to two conditions and finally a quartic equation for finding up to four solutions for the parameter pair. A subsequent backsubstitution directly leads to the corresponding camera poses with respect to the world reference frame. The superior computational efficiency is particularly suitable for any RANSAC-outlier-rejection step, which is always recommended before applying PnP or non-linear optimization of the final solution.

Download C/C++ code

Author: Laurent Kneip

Reference

L. Kneip, D. Scaramuzza, R. Siegwart. A Novel Parameterization of the Perspective-Three-Point Problem for a Direct Computation of Absolute Camera Position and Orientation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, USA, 2011. [ PDF ]

OCamCalib: Omnidirectional Camera Calibration Toolbox for Matlab

Omnidirectional Camera Calibration Toolbox for Matlab (for Windows, MacOS, and Linux) for catadioptric and fisheye cameras up to 195 degrees.

Code, tutorials, and datasets can be found here.

Author: Davide Scaramuzza

Reference

D. Scaramuzza, A. Martinelli, R. Siegwart. A Toolbox for Easily Calibrating Omnidirectional Cameras. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2006), Beijing, China, October 2006. [ PDF ]

D. Scaramuzza, A. Martinelli, R. Siegwart. A Flexible Technique for Accurate Omnidirectional Camera Calibration and Structure from Motion. IEEE International Conference on Computer Vision Systems (ICVS 2006), New York, USA, January 2006. [ PDF ]

Department of Informatics

Software/Datasets

GG-SSMs: Graph-Generating State Space Models

References

Student-Informed Teacher Training

References

Spotlight Presentation.

Data-driven Feature Tracking for Event Cameras with and without Frames

References

Code and Dataset for "Low Latency Automotive Vision with Event Cameras"

References

FaVoR: Features via Voxel Rendering for Camera Relocalization

References

GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

References

Drift-free Visual SLAM using Digital Twins

References

E-Calib: A Fast, Robust and Accurate Calibration Toolbox for Event Cameras

References

COVERED, CollabOratiVE Robot Environment Dataset for 3D Semantic segmentation

References

Hilti SLAM Challenge 2023: Benchmarking Single + Multi-session SLAM across Sensor Constellations in Construction

References

E-NeRF: Neural Radiance Fields from a Moving Event Camera

References

End-to-End Learned Event- and Image-based Visual Odometry

References

Reinforcement Learning Meets Visual Odometry

References

A Hybrid ANN-SNN Architecture for Low-Power and Low-Latency Visual Perception

References

State Space Models for Event Cameras

References

Spotlight Presentation.

An N-Point Linear Solver for Line and Motion Estimation with Event Cameras

References

Oral Presentation.

Mitigating Motion Blur in Neural Radiance Fields with Events and Frames

References

Contrastive Initial State Buffer for Reinforcement Learning

References

Dense Continuous-Time Optical Flow from Events and Frames

References

Revisiting Token Pruning for Object Detection and Instance Segmentation

References

From Chaos Comes Order: Ordering Event Representations for Object Recognition and Detection

References

Autonomous Power Line Inspection with Drones via Perception-Aware MPC

References

Best Paper Award!

Champion-level Drone Racing using Deep Reinforcement Learning

References

Active Exposure Control for Robust Visual Odometry in HDR Environments

References

Real-time Neural MPC: Deep Learning Model Predictive Control for Quadrotors and Agile Robotic Platforms

References

Cracking Double-Blind Review: Authorship Attribution with Deep Learning

References

Microgravity induces overconfidence in perceptual decision-making

References

Recurrent Vision Transformers for Object Detection with Event Cameras

References

Hilti-Oxford Dataset: A Millimetre-Accurate Benchmark for Simultaneous Localization and Mapping

References

Training Efficient Controllers via Analytic Policy Gradient

References

Tightly-coupled Fusion of Global Positional Measurements in Optimization-based Visual-Inertial Odometry

References

Data-driven Feature Tracking for Event Cameras

References

Award Candidate.

Event-based Shape from Polarization

References

Event-based Agile Object Catching with a Quadrupedal Robot

References

Learned Inertial Odometry for Autonomous Drone Racing

References

Agilicious: Open-Source and Open-Hardware Agile Quadrotor for Vision-Based Flight

References

Event-based Vision meets Deep Learning on Steering Prediction for Self-driving Cars