UrbanLLaVA-ICCV2025

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoing and Understanding

^†Department of Electronic Engineering, Tsinghua University
^‡Department of Computer Science and Technology, Tsinghua University
^§School of Electronic and Information Engineering, Beijing Jiaotong University
^¶University of Helsinki

Abstract

Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce UrbanLLaVA, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In UrbanLLaVA, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of UrbanLLaVA across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that UrbanLLaVA outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities.

@inproceedings{feng2025urbanllava, title={UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding}, author={Feng, Jie and Wang, Shengyuan and Liu, Tianhui and Xi, Yanxin and Li, Yong}, booktitle={Proceedings of the IEEE/CVF international conference on computer vision}, year={2025} }

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoing and Understanding

The Framework of UrbanLLaVA.

Abstract

Existing works vs. our UrbanLLaVA in urban research.

The thorough composition of UData in Beijing.

The performance of three-stage tuning, gray part is the default tuning method for MLLMs.

An example.

Another example.

Video Presentation

Poster

BibTeX

More for Urban Spatial Intelligence

CityGPT: Empowering Urban Spatial Cognition of Large Language Models

CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks

TrajAgent: An LLM-based Agent Framework for Automated Trajectory Modeling via Collaboration of Large and Small Models

AgentMove: A Large Language Model based Agentic Framework for Zero-shot Next Location Prediction

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoing and Understanding

The Framework of UrbanLLaVA.

Abstract

Existing works vs. our UrbanLLaVA in urban research.

The thorough composition of UData in Beijing.

The performance of three-stage tuning, gray part is the default tuning method for MLLMs.

An example.

Another example.

Video Presentation

Poster

BibTeX