CEOSpaceTech - Results

Menelaos Premiere

Internal

Menelaos - Premiere H2020 Results

Monodepth2 is a self-learning neural network that generates dense depth maps by using the Structure from Motion (SfM) approach. Monodepth2 was initially conceived for autonomous driving. We have adapted the network to the drone flight by taking advantage of the flight parameters measured by various onboard instruments.

menelaos premiere res1

Figure 1. Monodepth2 uses a standard, fully convolutional encoder-decoder network based on the general U-Net architecture. The encoder part of the network is ResNet18, which contains 11M parameters that are pretrained on ImageNet. The decoder part has sigmoid activation functions at the output and ELU nonlinearities elsewhere.

The network is trained by self-learning. More precisely, two chronologically adjacent frames I1 and I2 of the video sequence are considered. As in classical supervised learning, I1 is feedforwarded through the depth network that outputs an estimate D1 of its depth map. Based on D1 and with the help of camera intrinsic matrix K and camera pose change T1,2 between the two frames, the pixels of I1 are projected onto I2. Once projected, their values are recalculated by bilinear interpolation of the neighbouring pixels in I2. The difference between the actual values and the recalculated ones is the "engine" of the learning process.

menelaos premiere res2

Figure 2. Three examples of frames and the corresponding depth maps obtained with the original Monodepth2 and the modified network, adapted for drone flight. The ground truth is on the last column.

DATASETS

The following table summarises a series of datasets useful for our application. For each dataset, it is provided a short description the website address from where they can be downloaded. The items 1 – 10 are videos or high resolution images captured by an UAV. The main criterion for selecting them was the availability of ground truth for depth and camera position and orientation. The items 11 – 16 are non-drone general purpose benchmarks.

No.	Name	Short Description	Website address
DRONE videos and images
1	The Zurich Urban Micro Aerial Vehicle Dataset	Full HD, Low altitude drone footage shoot along 2 km of urban streets in Zurich. It also contains synchronized GPS & IMU sensor data and ground truth position of camera, making it ideal to evaluate and benchmark appearance-based topological localization, monocular visual odometry, simultaneous localization and mapping (SLAM).	http://rpg.ifi.uzh.ch/zurichmavdataset.html
2	Benchmark for Multi-Platform Photogrammetry	8/16 bits, 8176 x 6132 TIF images shoot from drone with NADIR and oblique view for photogrammetry purpose. Ground truth position of the camera is also available.	http://www2.isprs.org/commissions/comm1/icwg15b/benchmark_main.html
3	Mini-drone video dataset	Front-view drone images. The scope of the dataset is to detect suspicious and illicit behaviours. No depth information or camera pose is given.	https://www.epfl.ch/labs/mmspg/downloads/mini-drone/
4	Urban Drone Dataset (UDD)	4096 x 2160, low frame rate, PNG images taken from drone with nadir view. The images are pixel-wise semantically labelled for vegetation, buildings, roads, vehicles, roofs and others. No depth information is provided.	https://github.com/MarcWong/UDD
5	Drone (UAV) Dataset	Various images of rotary wing UAVs. Each image comes with a bounding box label for the location of the UAV inside the image. The purpose of the dataset is the training of a mid-air collision avoidance system with other UAVs.	https://www.kaggle.com/dasmehdixtr/drone-dataset-uav
6	The UZH FPV Dataset	346 x 260 PNG, 22 fps, indoor and outdoor drone images with IMU measurements synchronized. Images were collected while a drone executes various racing tracks resulting in aggressive trajectories. The ground truth pose of the drone is available.	https://fpv.ifi.uzh.ch/
7	Sense Fly Datasets	4608 x 3456 JPG, non-sequential, drone RGB images with NADIR view for topographic mapping purpose. The flight height is approx. 100 m. The 3D point cloud of the reconstructed scene is provided, as well as the flight trajectory (GPS coordinates/altitude/timestamp)	https://www.sensefly.com/education/datasets/?dataset=3121
8	Semantic Drone Dataset	6000x4000, 1 Hz, drone RGB images with nadir view for semantic segmentation purposes. The images are pixel-wise semantically labelled for a lot of objects. The dataset provides fish-eye stereo images sampled at 5Hz with synchronized IMU measurements.	https://www.tugraz.at/index.php?id=22387
9	The EuRoC MAV Dataset	752x480, 20 fps, stereo-pairs, drone images taken while flying through industrial halls. The sequences are divided according to the difficulty of the drone trajectory. Ground truth pose of the camera is provided for each sequence.	https://projects.asl.ethz.ch/datasets/doku.php?id=kmavvisualinertialdatasets
10	MidAir Dataset	1021 x 1024, 30 fps, stereo-pairs, RGB synthetically rendered images. It consists of low altitude drone flights through rough terrain. The stereo camera setup is facing forward, while a third one is facing nadir. The various ground truth data includes depth maps and the drone pose.	https://midair.ulg.ac.be/
NON-DRONE general purpose benchmarks
11	KITTI Vision Benchmark Suite	1392 x 512 Images recorded from an automotive platform equipped with both RGB and grayscale stereo cameras. Environmental depth is provided via 360 Velodyne Laserscanner. Ground truth pose of the camera is given through a GPS & IMU mounted on board.	http://www.cvlibs.net/datasets/kitti/index.php
12	3DF Zephyr reconstruction showcase	3648 x 5472 non-drone images from different angles of various monuments / statues / historic buildings used for their virtual 3D reconstruction. No environmental depth or camera pose information are available.	https://www.3dflow.net/3df-zephyr-reconstruction-showcase/
13	Cityscapes dataset	2048 x 1024, 30 fps, automotive stereo images with precomputed ground truth depth maps and vehicle odometry & GPS data available.	https://www.cityscapes-dataset.com/
14	NYU Depth Dataset V2	640 x 480 BMP, non-sequential, indoor images syncronized with the Microsoft Kinect depth map and Gyroscopic data (consisting in the roll, yaw and pitch angle of the camera device).	https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
15	Make3D Benchmark	1024 x 768 JPG, non-sequential, RBG stereo-pairs images from outdoors. Ground truth depth provided through LiDAR scanning.	http://make3d.cs.cornell.edu/data.html#imagelaserstereo
16	Scene Flow Dataset	960x540 stereo-pairs, RGB images rendered from various synthetic virtual sequences. Pixel-wise segmentation, optical flow maps and disparity ground truth maps are available.	https://lmb.informatik.uni-freiburg.de/resources/datasets/SceneFlowDatasets.en.html