Flame3D Zero-shot Compositional Reasoning of 3D Scenes
with Agentic Language Models

Carnegie Mellon University
Flame3D answers compositional, multi-hop spatial queries about a 3D scene by chaining inferences over a structured visual-textual scene memory.
Flame3D answers compositional, multi-hop spatial queries about a 3D scene by composing chain-of-thought inferences over a structured visual-textual scene memory and a systematically designed set of spatial and visual tool calls along with external knowledge.
// TL;DR

Flame3D achieves compositional reasoning by treating 3D scenes as an editable scene memory exposed to off-the-shelf MLLMs through spatio-visual tool calls. Rather than relying on fixed operations, the agent can synthesize dynamic spatial programs at inference time to reason about free space, complex spatial relations, layouts, and hypothetical objects. New attributes and external knowledge can be seamlessly integrated without retraining. No 3D-specific training or point-cloud tokens needed.

See it in action

Framework Overview

Flame3D framework overview: scene memory, tool hierarchy, and agentic reasoning loop.
When a complex natural-language query is received, an off-the-shelf tool-calling vision-language model breaks it down into a sequence of spatial and external tool-calls to produce a grounded answer. The agent composes these inferences by interacting with the structured scene memory through a collection of spatial tools (search, distance, vicinity search, navigation distance, image retrieval, and code execution) and optional external tools (e.g., compliance databases, machine profiles, pricing catalogs, internet search). The scene memory is constructed from posed RGB-D frames as a 3D spatial database.

Insights from our experiments

Genuine 3D understanding may not require training on 3D-language data.

Fixed toolsets are insufficient for open-ended 3D queries.

Visual input is often not required for spatial reasoning queries.

Sole reliance on language priors fails for multi-hop spatial reasoning queries.

Meta tool is surprisingly powerful but not sufficient.

Scaling foundation models improves spatial reasoning.

Programs synthesized at inference time

Ambiguity in the ScanQA ground truth

Our method often generates contextually valid answers but are penalized by ambiguity in the ScanQA ground-truth.

Citation

@misc{bharadwaj2026flame3d,
      title={Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models}, 
      author={Sagar Bharadwaj and Ziyong Ma and Anurag Ghosh and Srinivasan Seshan and Anthony Rowe},
      year={2026},
      eprint={2605.09218},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.09218}, 
}