TL;DR: CityNav is a dataset for vision-and-language aerial navigation that consists of human-generated trajectories paired with descriptions on real-world 3D cities.
We introduce CityNav, a new dataset for language-goal aerial navigation using a 3D point cloud representation from the real-world cities. CityNav includes 32,637 natural language descriptions paired with human demonstration trajectories, collected from participants via a new web-based 3D simulator developed for this research. Each description specifies a navigation goal, leveraging the names and locations of landmarks within the real-world cities. We also provide baseline models of navigation agents that incorporate an internal 2D spatial map representing landmarks referenced in the descriptions.
The aerial agent is randomly spawned in the city and must locate the target object corresponding to a given linguistic description, using the agent's first-person view images and geographic information.
To collect trajectory data via the web, we developed a web-based flight simulator that allows users to operate an aerial agent within 3D enviroments.
Dataset Statistics: (a) summarizes the statistics of the number of scenes and trajectories for each set, (b) illustrates the distributions for the length of collected trajectories. (c) illustrates the distributions for the description length corresponding to the trajectories, (d) shows the distance distribution of eval splits from the starting point to the goal, (e) shows the episode length of both the shortest path and human demonstration trajectories, and (f) shows action histograms for the shortest path and human demonstration.
Map-based Goal Predictor (MGP) is our proposed model that combines state-of-the-art off-the-shelf models to perform map-based goal prediction. It utilizes navigation maps generated at each time step through the following three steps: (i) target, landmark, and surroundings name extraction by GPT-3.5 Turbo, (ii) object detection and segmentation by GroundingDINO and Mobile-SAM, (iii) optional coordinate refinement by LLaVA-1.6-34b using the set-of-mark prompting. A map encoder, using a navigation map that includes a landmark map, view & explore area maps, and target & surroundings maps, is trained alongside the RGB and depth encoders of Cross-Modal Attention.
@misc{lee2024citynav,
title={CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information},
author={Jungdae Lee and Taiki Miyanishi and Shuhei Kurita and Koya Sakamoto and Daichi Azuma and Yutaka Matsuo and Nakamasa Inoue},
year={2024},
eprint={2406.14240},
archivePrefix={arXiv}
}