CityNav: Language-Goal Aerial Navigation Dataset
with Geographic Information

Jungdae Lee1*, Taiki Miyanishi2,4,5*, Shuhei Kurita3,5, Koya Sakamoto6,4,
Daichi Azuma7, Yutaka Matsuo2, Nakamasa Inoue1
(* indicates equal contribution)
1Tokyo Institute of Technology, 2The University of Tokyo, 3NII, 4ATR,
5RIKEN AIP, 6Kyoto University, 7Sony Semiconductor Solutions

TL;DR: CityNav is a dataset for vision-and-language aerial navigation that consists of human-generated trajectories paired with descriptions on real-world 3D cities.

teaser

Overview

We introduce CityNav, a new dataset for language-goal aerial navigation using a 3D point cloud representation from the real-world cities. CityNav includes 32,637 natural language descriptions paired with human demonstration trajectories, collected from participants via a new web-based 3D simulator developed for this research. Each description specifies a navigation goal, leveraging the names and locations of landmarks within the real-world cities. We also provide baseline models of navigation agents that incorporate an internal 2D spatial map representing landmarks referenced in the descriptions.


Aerial Navigation Task

The aerial agent is randomly spawned in the city and must locate the target object corresponding to a given linguistic description, using the agent's first-person view images and geographic information.


Web-based 3D Flight Simulator

To collect trajectory data via the web, we developed a web-based flight simulator that allows users to operate an aerial agent within 3D enviroments.


Dataset Statistics

statistics of our proposed dataset

Dataset Statistics: (a) summarizes the statistics of the number of scenes and trajectories for each set, (b) illustrates the distributions for the length of collected trajectories. (c) illustrates the distributions for the description length corresponding to the trajectories, (d) shows the distance distribution of eval splits from the starting point to the goal, (e) shows the episode length of both the shortest path and human demonstration trajectories, and (f) shows action histograms for the shortest path and human demonstration.


Map-based Goal Predictor

our proposed method

Map-based Goal Predictor (MGP) is our proposed model that combines state-of-the-art off-the-shelf models to perform map-based goal prediction. It utilizes navigation maps generated at each time step through the following three steps: (i) target, landmark, and surroundings name extraction by GPT-3.5 Turbo, (ii) object detection and segmentation by GroundingDINO and Mobile-SAM, (iii) optional coordinate refinement by LLaVA-1.6-34b using the set-of-mark prompting. A map encoder, using a navigation map that includes a landmark map, view & explore area maps, and target & surroundings maps, is trained alongside the RGB and depth encoders of Cross-Modal Attention.


Aerial Navigation Results

BibTeX


      @misc{lee2024citynav,
        title={CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information}, 
        author={Jungdae Lee and Taiki Miyanishi and Shuhei Kurita and Koya Sakamoto and Daichi Azuma and Yutaka Matsuo and Nakamasa Inoue},
        year={2024},
        eprint={2406.14240},
        archivePrefix={arXiv}
      }