Menelaos - Premiere H2020 Results
Monodepth2 is a self-learning neural network that generates dense depth maps by using the Structure from Motion (SfM) approach. Monodepth2 was initially conceived for autonomous driving. We have adapted the network to the drone flight by taking advantage of the flight parameters measured by various onboard instruments.
Figure 1. Monodepth2 uses a standard, fully convolutional encoder-decoder network based on the general U-Net architecture. The encoder part of the network is ResNet18, which contains 11M parameters that are pretrained on ImageNet. The decoder part has sigmoid activation functions at the output and ELU nonlinearities elsewhere.
The network is trained by self-learning. More precisely, two chronologically adjacent frames I1 and I2 of the video sequence are considered. As in classical supervised learning, I1 is feedforwarded through the depth network that outputs an estimate D1 of its depth map. Based on D1 and with the help of camera intrinsic matrix K and camera pose change T1,2 between the two frames, the pixels of I1 are projected onto I2. Once projected, their values are recalculated by bilinear interpolation of the neighbouring pixels in I2. The difference between the actual values and the recalculated ones is the "engine" of the learning process.
Figure 2. Three examples of frames and the corresponding depth maps obtained with the original Monodepth2 and the modified network, adapted for drone flight. The ground truth is on the last column.
DATASETS
The following table summarises a series of datasets useful for our application. For each dataset, it is provided a short description the website address from where they can be downloaded. The items 1 – 10 are videos or high resolution images captured by an UAV. The main criterion for selecting them was the availability of ground truth for depth and camera position and orientation. The items 11 – 16 are non-drone general purpose benchmarks.
No. | Name | Short Description | Website address |
DRONE videos and images | |||
1 | The Zurich Urban Micro Aerial Vehicle Dataset |
Full HD, Low altitude drone footage shoot along 2 km of urban streets in Zurich. It also contains synchronized GPS & IMU sensor data and ground truth position of camera, making it ideal to evaluate and benchmark appearance-based topological localization, monocular visual odometry, simultaneous localization and mapping (SLAM). |
|
2 | Benchmark for Multi-Platform Photogrammetry | 8/16 bits, 8176 x 6132 TIF images shoot from drone with NADIR and oblique view for photogrammetry purpose. Ground truth position of the camera is also available. | http://www2.isprs.org/commissions/comm1/icwg15b/benchmark_main.html |
3 | Mini-drone video dataset | Front-view drone images. The scope of the dataset is to detect suspicious and illicit behaviours. No depth information or camera pose is given. | https://www.epfl.ch/labs/mmspg/downloads/mini-drone/ |
4 | Urban Drone Dataset (UDD) | 4096 x 2160, low frame rate, PNG images taken from drone with nadir view. The images are pixel-wise semantically labelled for vegetation, buildings, roads, vehicles, roofs and others. No depth information is provided. | https://github.com/MarcWong/UDD |
5 | Drone (UAV) Dataset | Various images of rotary wing UAVs. Each image comes with a bounding box label for the location of the UAV inside the image. The purpose of the dataset is the training of a mid-air collision avoidance system with other UAVs. | https://www.kaggle.com/dasmehdixtr/drone-dataset-uav |
6 | The UZH FPV Dataset | 346 x 260 PNG, 22 fps, indoor and outdoor drone images with IMU measurements synchronized. Images were collected while a drone executes various racing tracks resulting in aggressive trajectories. The ground truth pose of the drone is available. | https://fpv.ifi.uzh.ch/ |
7 | Sense Fly Datasets | 4608 x 3456 JPG, non-sequential, drone RGB images with NADIR view for topographic mapping purpose. The flight height is approx. 100 m. The 3D point cloud of the reconstructed scene is provided, as well as the flight trajectory (GPS coordinates/altitude/timestamp) | https://www.sensefly.com/education/datasets/?dataset=3121 |
8 | Semantic Drone Dataset | 6000x4000, 1 Hz, drone RGB images with nadir view for semantic segmentation purposes. The images are pixel-wise semantically labelled for a lot of objects. The dataset provides fish-eye stereo images sampled at 5Hz with synchronized IMU measurements. | https://www.tugraz.at/index.php?id=22387 |
9 | The EuRoC MAV Dataset | 752x480, 20 fps, stereo-pairs, drone images taken while flying through industrial halls. The sequences are divided according to the difficulty of the drone trajectory. Ground truth pose of the camera is provided for each sequence. | https://projects.asl.ethz.ch/datasets/doku.php?id=kmavvisualinertialdatasets |
10 | MidAir Dataset | 1021 x 1024, 30 fps, stereo-pairs, RGB synthetically rendered images. It consists of low altitude drone flights through rough terrain. The stereo camera setup is facing forward, while a third one is facing nadir. The various ground truth data includes depth maps and the drone pose. | https://midair.ulg.ac.be/ |
NON-DRONE general purpose benchmarks | |||
11 | KITTI Vision Benchmark Suite | 1392 x 512 Images recorded from an automotive platform equipped with both RGB and grayscale stereo cameras. Environmental depth is provided via 360 Velodyne Laserscanner. Ground truth pose of the camera is given through a GPS & IMU mounted on board. | http://www.cvlibs.net/datasets/kitti/index.php |
12 | 3DF Zephyr reconstruction showcase | 3648 x 5472 non-drone images from different angles of various monuments / statues / historic buildings used for their virtual 3D reconstruction. No environmental depth or camera pose information are available. | https://www.3dflow.net/3df-zephyr-reconstruction-showcase/ |
13 | Cityscapes dataset | 2048 x 1024, 30 fps, automotive stereo images with precomputed ground truth depth maps and vehicle odometry & GPS data available. | https://www.cityscapes-dataset.com/ |
14 | NYU Depth Dataset V2 | 640 x 480 BMP, non-sequential, indoor images syncronized with the Microsoft Kinect depth map and Gyroscopic data (consisting in the roll, yaw and pitch angle of the camera device). | https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html |
15 | Make3D Benchmark | 1024 x 768 JPG, non-sequential, RBG stereo-pairs images from outdoors. Ground truth depth provided through LiDAR scanning. | http://make3d.cs.cornell.edu/data.html#imagelaserstereo |
16 | Scene Flow Dataset | 960x540 stereo-pairs, RGB images rendered from various synthetic virtual sequences. Pixel-wise segmentation, optical flow maps and disparity ground truth maps are available. | https://lmb.informatik.uni-freiburg.de/resources/datasets/SceneFlowDatasets.en.html |