Language-Image Models
with 3D Understanding

1UT Austin, 2NVIDIA Research *Equal advising

Cube-LLM can reason about both indoor and outdoor scenes in 3D from a single image.


Cube-LLM consists of a vision encoder (DINO v2-L) and a decoder-only LLM (Vicuna 7B). We avoid any 3D-specific architecture design or training objective.

overview image.

Visual CoT Prompting

Just like LLM, Cube-LLM improves its prediction by chain-of-thought prompting (CoT), connecting similar reasoning steps together from 2D to 3D bounding boxes.

Specialist Model Prompting

Cube-LLM can further improve its predictions by incorporating specialist models of any modalities. Cube-LLM simply takes their predictions as additional prompt.

Outdoor Visualization

Indoor Visualization


      title={Language-Image Models with 3D Understanding},
      author={Cho, Jang Hyun and Ivanovic, Boris and Cao, Yulong and Schmerling, Edward and Wang, Yue and Weng, Xinshuo and Li, Boyi and You, Yurong and Kr{\"a}henb{\"u}hl, Philipp and Wang, Yan and others},
      journal={arXiv preprint arXiv:2405.03685},