Menelaos - Premiere H2020 Results

Monodepth2 is a self-learning neural network that generates dense depth maps by using the Structure from Motion (SfM) approach. Monodepth2 was initially conceived for autonomous driving. We have adapted the network to the drone flight by taking advantage of the flight parameters measured by various onboard instruments.

menelaos premiere res1

Figure 1. Monodepth2 uses a standard, fully convolutional encoder-decoder network based on the general U-Net architecture. The encoder part of the network is ResNet18, which contains 11M parameters that are pretrained on ImageNet. The decoder part has sigmoid activation functions at the output and ELU nonlinearities elsewhere. 

The network is trained by self-learning. More precisely, two chronologically adjacent frames I1 and I2 of the video sequence are considered. As in classical supervised learning, I1 is feedforwarded through the depth network that outputs an estimate D1 of its depth map. Based on D1 and with the help of camera intrinsic matrix K and camera pose change T1,2 between the two frames, the pixels of I1 are projected onto I2. Once projected, their values are recalculated by bilinear interpolation of the neighbouring pixels in I2. The difference between the actual values and the recalculated ones is the "engine" of the learning process.

menelaos premiere res2

Figure 2. Three examples of frames and the corresponding depth maps obtained with the original Monodepth2 and the modified network, adapted for drone flight. The ground truth is on the last column.

DATASETS

The following table summarises a series of datasets useful for our application. For each dataset, it is provided a short description the website address from where they can be downloaded. The items 1 – 10 are videos or high resolution images captured by an UAV. The main criterion for selecting them was the availability of ground truth for depth and camera position and orientation. The items 11 – 16 are non-drone general purpose benchmarks.

 

No. Name Short Description Website address
DRONE videos and images
1 The Zurich Urban Micro Aerial Vehicle Dataset

Full HD, Low altitude drone footage shoot along 2 km of urban streets in Zurich.

It also contains synchronized GPS & IMU sensor data and ground truth position of camera, making it  ideal to evaluate and benchmark appearance-based topological localization, monocular visual odometry, simultaneous localization and mapping (SLAM).

http://rpg.ifi.uzh.ch/zurichmavdataset.html

2 Benchmark for Multi-Platform Photogrammetry 8/16 bits, 8176 x 6132 TIF images shoot from drone with NADIR and oblique view for photogrammetry purpose. Ground truth position of the camera is also available. http://www2.isprs.org/commissions/comm1/icwg15b/benchmark_main.html
3 Mini-drone video dataset Front-view drone images. The scope of the dataset is to detect suspicious and illicit behaviours. No depth information or camera pose is given. https://www.epfl.ch/labs/mmspg/downloads/mini-drone/
4 Urban Drone Dataset (UDD) 4096 x 2160, low frame rate, PNG images taken from drone with nadir view. The images are pixel-wise semantically labelled for vegetation, buildings, roads, vehicles, roofs and others. No depth information is provided. https://github.com/MarcWong/UDD
5 Drone (UAV) Dataset Various images of rotary wing UAVs. Each image comes with a bounding box label for the location of the UAV inside the image. The purpose of the dataset is the training of a mid-air collision avoidance system with other UAVs. https://www.kaggle.com/dasmehdixtr/drone-dataset-uav
6 The UZH FPV Dataset 346 x 260 PNG, 22 fps, indoor and outdoor drone images with IMU measurements synchronized. Images were collected while a drone executes various racing tracks resulting in aggressive trajectories. The ground truth pose of the drone is available. https://fpv.ifi.uzh.ch/
7 Sense Fly Datasets 4608 x 3456 JPG, non-sequential, drone RGB images with NADIR view for topographic mapping purpose. The flight height is approx. 100 m. The 3D point cloud of the reconstructed scene is provided, as well as the flight trajectory (GPS coordinates/altitude/timestamp) https://www.sensefly.com/education/datasets/?dataset=3121
8 Semantic Drone Dataset 6000x4000, 1 Hz, drone RGB images with nadir view for semantic segmentation purposes.  The images are pixel-wise semantically labelled for a lot of objects. The dataset provides fish-eye stereo images sampled at 5Hz with synchronized IMU measurements. https://www.tugraz.at/index.php?id=22387
9 The EuRoC MAV Dataset 752x480, 20 fps, stereo-pairs, drone images taken while flying through industrial halls. The sequences are divided according to the difficulty of the drone trajectory. Ground truth pose of the camera is provided for each sequence. https://projects.asl.ethz.ch/datasets/doku.php?id=kmavvisualinertialdatasets
10 MidAir Dataset 1021 x 1024, 30 fps, stereo-pairs, RGB synthetically rendered images. It consists of low altitude drone flights through rough terrain. The stereo camera setup is facing forward, while a third one is facing nadir. The various ground truth data includes depth maps and the drone pose. https://midair.ulg.ac.be/
NON-DRONE general purpose benchmarks
11 KITTI Vision Benchmark Suite 1392 x 512 Images recorded from an automotive platform equipped with both RGB and grayscale stereo cameras. Environmental depth is provided via 360 Velodyne Laserscanner. Ground truth pose of the camera is given through a GPS & IMU mounted on board. http://www.cvlibs.net/datasets/kitti/index.php
12 3DF Zephyr reconstruction showcase 3648 x 5472 non-drone images from different angles of various monuments / statues / historic buildings used for their virtual 3D reconstruction. No environmental depth or camera pose information are available. https://www.3dflow.net/3df-zephyr-reconstruction-showcase/
13 Cityscapes dataset 2048 x 1024, 30 fps, automotive stereo images with precomputed ground truth depth maps and vehicle odometry & GPS data available. https://www.cityscapes-dataset.com/
14 NYU Depth Dataset V2 640 x 480 BMP, non-sequential, indoor images syncronized with the Microsoft Kinect depth map and Gyroscopic data (consisting in the roll, yaw and pitch angle of the camera device). https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
15 Make3D Benchmark 1024 x 768 JPG, non-sequential, RBG stereo-pairs images from outdoors. Ground truth depth provided through LiDAR scanning. http://make3d.cs.cornell.edu/data.html#imagelaserstereo
16 Scene Flow Dataset 960x540 stereo-pairs, RGB images rendered from various synthetic virtual sequences. Pixel-wise segmentation, optical flow maps and disparity ground truth maps are available. https://lmb.informatik.uni-freiburg.de/resources/datasets/SceneFlowDatasets.en.html