CityNav: Language-Goal Aerial Navigation Dataset
with Geographic Information

Jungdae Lee^1*, Taiki Miyanishi^2,4,5*, Shuhei Kurita^3,5, Koya Sakamoto^6,4,
Daichi Azuma⁷, Yutaka Matsuo², Nakamasa Inoue¹
(* indicates equal contribution)

¹Tokyo Institute of Technology, ²The University of Tokyo, ³NII, ⁴ATR,
⁵RIKEN AIP, ⁶Kyoto University, ⁷Sony Semiconductor Solutions

Paper Code

TL;DR: CityNav is a dataset for vision-and-language aerial navigation that consists of human-generated trajectories paired with descriptions on real-world 3D cities.

Overview

We introduce CityNav, a new dataset for language-goal aerial navigation using a 3D point cloud representation from the real-world cities. CityNav includes 32,637 natural language descriptions paired with human demonstration trajectories, collected from participants via a new web-based 3D simulator developed for this research. Each description specifies a navigation goal, leveraging the names and locations of landmarks within the real-world cities. We also provide baseline models of navigation agents that incorporate an internal 2D spatial map representing landmarks referenced in the descriptions.

Aerial Navigation Task

The aerial agent is randomly spawned in the city and must locate the target object corresponding to a given linguistic description, using the agent's first-person view images and geographic information.

Web-based 3D Flight Simulator

To collect trajectory data via the web, we developed a web-based flight simulator that allows users to operate an aerial agent within 3D enviroments.

Dataset Statistics

Dataset Statistics: (a) summarizes the statistics of the number of scenes and trajectories for each set, (b) illustrates the distributions for the length of collected trajectories. (c) illustrates the distributions for the description length corresponding to the trajectories, (d) shows the distance distribution of eval splits from the starting point to the goal, (e) shows the episode length of both the shortest path and human demonstration trajectories, and (f) shows action histograms for the shortest path and human demonstration.

Map-based Goal Predictor

Map-based Goal Predictor (MGP) is our proposed model that combines state-of-the-art off-the-shelf models to perform map-based goal prediction. It utilizes navigation maps generated at each time step through the following three steps: (i) target, landmark, and surroundings name extraction by GPT-3.5 Turbo, (ii) object detection and segmentation by GroundingDINO and Mobile-SAM, (iii) optional coordinate refinement by LLaVA-1.6-34b using the set-of-mark prompting. A map encoder, using a navigation map that includes a landmark map, view & explore area maps, and target & surroundings maps, is trained alongside the RGB and depth encoders of Cross-Modal Attention.

BibTeX


      @misc{lee2024citynav,
        title={CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information}, 
        author={Jungdae Lee and Taiki Miyanishi and Shuhei Kurita and Koya Sakamoto and Daichi Azuma and Yutaka Matsuo and Nakamasa Inoue},
        year={2024},
        eprint={2406.14240},
        archivePrefix={arXiv}
      }

CityNav: Language-Goal Aerial Navigation Datasetwith Geographic Information