Language-Image Models
with 3D Understanding

Jang Hyun Cho^1,2, Boris Ivanovic², Yulong Cao², Edward Schmerling², Yue Wang², Xinshuo Weng², Boyi Li², Yurong You², Philipp Krähenbühl^1,*, Yan Wang^2,*, Marco Pavone^2,*

¹UT Austin, ²NVIDIA Research ^*Equal advising

arXiv Code (coming soon)

Cube-LLM can reason about both indoor and outdoor scenes in 3D from a single image.

Overview

Cube-LLM consists of a vision encoder (DINO v2-L) and a decoder-only LLM (Vicuna 7B). We avoid any 3D-specific architecture design or training objective.

Visual CoT Prompting

Just like LLM, Cube-LLM improves its prediction by chain-of-thought prompting (CoT), connecting similar reasoning steps together from 2D to 3D bounding boxes.

Specialist Model Prompting

Cube-LLM can further improve its predictions by incorporating specialist models of any modalities. Cube-LLM simply takes their predictions as additional prompt.

Outdoor Visualization

Indoor Visualization

BibTeX

@article{cho2024language,
      title={Language-Image Models with 3D Understanding},
      author={Cho, Jang Hyun and Ivanovic, Boris and Cao, Yulong and Schmerling, Edward and Wang, Yue and Weng, Xinshuo and Li, Boyi and You, Yurong and Kr{\"a}henb{\"u}hl, Philipp and Wang, Yan and others},
      journal={arXiv preprint arXiv:2405.03685},
      year={2024}
    }