Flame3D Zero-shot Compositional Reasoning of 3D Scenes
with Agentic Language Models

Sagar Bharadwaj Ziyong Ma Anurag Ghosh Srinivasan Seshan Anthony Rowe

Carnegie Mellon University

Paper Video Code (coming soon) Data (coming soon)

Flame3D answers compositional, multi-hop spatial queries about a 3D scene by composing chain-of-thought inferences over a structured visual-textual scene memory and a systematically designed set of spatial and visual tool calls along with external knowledge.

// TL;DR

Flame3D achieves compositional reasoning by treating 3D scenes as an editable scene memory exposed to off-the-shelf MLLMs through spatio-visual tool calls. Rather than relying on fixed operations, the agent can synthesize dynamic spatial programs at inference time to reason about free space, complex spatial relations, layouts, and hypothetical objects. New attributes and external knowledge can be seamlessly integrated without retraining. No 3D-specific training or point-cloud tokens needed.

See it in action

Query

I want to move the copier elsewhere so that we have can have foosball table. Do you have a better placement of the foosball table in mind? Create a box keeping dimensions of an average foosball table and 2 players playing on it, and no other objects overlapping the box.

Speed

What gets composed

Search Web Retrieval Vicinity Search Execute Code

Query

What are the safety hazards in this woodworking workshop? Look at hazardous items close to electrical equipments that might be fire hazards.

Speed

What gets composed

Vicinity Search Distance Search

Query

How many people can sleep in this apartment?

Speed

What gets composed

Search Get Dimensions Retrieval

Query

Where can place a new centrifuge in this lab? Give me box placement on a surface using the dimensions of a standard centrifuge.

Speed

What gets composed

Search Web Retrieval Execute Code

Query

We are trying to decorate our house, and I'm in Ikea right now. Can you check how many plants do we have in our apartment? What are the different varieties of plants that we do have?

Speed

What gets composed

Search Get Image Plant Database

Framework Overview

When a complex natural-language query is received, an off-the-shelf tool-calling vision-language model breaks it down into a sequence of spatial and external tool-calls to produce a grounded answer. The agent composes these inferences by interacting with the structured scene memory through a collection of spatial tools (search, distance, vicinity search, navigation distance, image retrieval, and code execution) and optional external tools (e.g., compliance databases, machine profiles, pricing catalogs, internet search). The scene memory is constructed from posed RGB-D frames as a 3D spatial database.

Insights from our experiments

Genuine 3D understanding may not require training on 3D-language data.

Fixed toolsets are insufficient for open-ended 3D queries.

Visual input is often not required for spatial reasoning queries.

Sole reliance on language priors fails for multi-hop spatial reasoning queries.

Meta tool is surprisingly powerful but not sufficient.

Scaling foundation models improves spatial reasoning.

Programs synthesized at inference time

Reason about free space

Would an IKEA Billy bookshelf fit between the two tables in this room.

Tool Usage

Search for tables.

Retrieve from Web IKEA Billy dimensions.

Execute Program to get the 3D bounding box that fits between tables.

Scene with two tables and a computed bounding box for an IKEA Billy bookshelf between them.

Pseudocode of the executed program

B1, B2 ← bbox corners of tables

gaps ← axis-wise separations between the boxes
axis ← axis with largest positive gap

width ← gaps[axis]

if width < 80 cm :       Retrieved from Web
    return false

shelf_base ← gap interval on axis + overlap interval on other
                            horizontal axis
shelf_height ← [floor_z, floor_z + 2.06 m]  Retrieved from Web

bbox ← 8 corners of shelf_base x shelf_height

return bbox

                Response
                An IKEA Billy shelf would fit between table-1 and table-2. After giving a safe clearance of 5 cm, there is a gap of 1.23
                m. The bounding box shows a potential location for IKEA Billy.
              

Complex spatial relationship

At what fraction of the length of the super-heater outlet is the safety valve located?

Tool Usage

Search for super-heater outlet, safety valve

Execute Program to project the valve down to the outlet axis.

Industrial pipe scene with super-heater outlet highlighted in red and safety valve in green.

Pseudocode of the executed program

outlet ← GetComponentInfo(56)
valve  ← GetComponentInfo(218)

axis ← longest edge of outlet bounding box
L    ← length(axis)
u    ← axis / L

p_outlet ← reference corner of outlet
p_valve  ← center of valve bounding box

# Project valve onto outlet
fraction ← dot(p_valve - p_outlet, u) / L
return fraction

                Response
                The safety-valve is mounted at about 6.5% from one end along the
                length of the super-heater outlet.
              

Ambiguity in the ScanQA ground truth

Our method often generates contextually valid answers but are penalized by ambiguity in the ScanQA ground-truth.

Query

What color are the chairs on the wall?

Speed

GT answer

"red"
"red and black"

Our answer

"maroon"

Query

What type of loveseat is in a library?

Speed

GT answer

"black"
"gray loveseats"

Our answer

"black leather loveseat"

Query

On what part of the kitchen is the black square tv located?

Speed

GT answer

"wall"
"on right wall"

Our answer

"back of the kitchen"

Query

What is the monitor closest to?

Speed

GT answer

"closest to whiteboard"
"cabinet"

Our answer

"box"

Citation

@misc{bharadwaj2026flame3d,
      title={Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models}, 
      author={Sagar Bharadwaj and Ziyong Ma and Anurag Ghosh and Srinivasan Seshan and Anthony Rowe},
      year={2026},
      eprint={2605.09218},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.09218}, 
}