Abstract


Spatiotemporal reasoning is a fundamental capability for artificial intelligence (AI) in soft tissue surgery, paving the way for intelligent assistive systems and autonomous robotics. While 2D vision-language models show increasing promise at understanding surgical video, the spatial complexity of surgical scenes suggests that reasoning systems may benefit from explicit 4D representations. Here, we propose a framework for equipping surgical agents with spatiotemporal tools based on an explicit 4D representation, enabling AI systems to ground their natural language reasoning in both time and 3D space. Leveraging models for point tracking, depth, and segmentation, we develop a coherent 4D model with spatiotemporally consistent tool and tissue semantics. A Multimodal Large Language Model (MLLM) then acts as an agent on tools derived from the explicit 4D representation (e.g., trajectories) without any fine-tuning. We evaluate our method on a new dataset of 134 clinically relevant questions and find that the combination of a general purpose reasoning backbone and our 4D representation significantly improves spatiotemporal understanding and allows for 4D grounding. We demonstrate that spatiotemporal intelligence can be “assembled” from 2D MLLMs and 3D computer vision models without additional training.

Method


Given a monocular 2D video sequence, our method creates a tracked semantic 4D representation and provides spatiotemporal information as tools to a 2D MLLM Agent. Upon receiving a query, the agent iteratively requests and interprets spatiotemporal information from our 4D representation to inform its response.
Method Overview Image

Results


We evaluate on 134 clinically motivated queries that require spatiotemporal reasoning over tool–tissue interaction. In particular, we want to understand how tissue is manipulated as the decision-making on grasping points, retraction directions, and dissection locations is a key indicator for surgical intelligence. Tasks are spatial (where an interaction occurs in 3D at a timestep), temporal (when a query holds), and directional (how motion evolves).

How To Navigate Below 4D Scenes

Each row below shows the monocular input clip (left), the RGB 4D reconstruction (center), and the same geometry with semantic instance colors (right). The 4D viewers are interactive: pan by holding left-click and moving your cursor, translate the camera by holding left- and right-click, rotate around the camera axis by holding option, and zoom by scrolling. With space, you can start and stop playback. With the arrow keys, you can step frame by frame. You may also use the control bar to loop or scrub the timeline. Loading the Rerun files may take a moment.

Interactive Examples of Tracked 4D Reconstructions from Unposed 2D Monocular Video

Monocular video
RGB point cloud
Semantic instance colors
Monocular video
RGB point cloud
Semantic instance colors
Monocular video
RGB point cloud
Semantic instance colors

Spatial grounding

To perform spatial grounding in 3D, we pose a query for a specific timestep. Below examples show the corresponding frame on the left, the interactive 4D visualization for the whole sequence in the middle, and the 2D projection of the model's answer on the right, overlayed as a blue dot on the frame corresponding to the query. Note that the agent does not receive that frame and only operates on the 4D representation.
To interpret the 4D visualization correctly: at the specific timestep, structures and interaction points that the agent requested via a tool call will light up in blue and red and their 3D state at that time will remain frozen and visible as the sequence progresses. You may use the arrow keys or the timeline scrollbar to easily find the relevant timestep. For the agent, only this state matters for the spatial queries. Interaction points are highlighted with a red dot in 3D. We also project the point into the 2D frame, as shown in the images on the right, which allows for comparison with 2D baselines.

Where is the grasper in the back gripping the gallbladder?

Frame at query time
GT frame video24_09676
4D Visualization
Predicted 2D point
Prediction spatial 0

Where is the L-hook coagulating the gallbladder?

Frame at query time
GT frame video12_19980
4D Visualization
Predicted 2D point
Prediction

Where is the grasper behind the L-hook holding the gallbladder?

Frame at query time
GT frame video12_19500
4D Visualization
Predicted 2D point
Prediction

Where does the grasper grip the gallbladder?

Frame at query time
GT frame video17_01803
4D Visualization
Predicted 2D point
Prediction

Where is the L-hook hooked into the connective tissue of the gallbladder?

Frame at query time
GT frame video55_00588
4D Visualization
Predicted 2D point
Prediction

Temporal grounding

For temporal grounding, we ask the agent to either determine the starting point or the duration of a query. All temporal queries require spatial reasoning as well, as shown in the example below.
To interpret the visualization, the relevant structures the agent interacted with via tool calls are again highlighted in dark blue and red. Here, they were requested over the whole sequence and distance and overlap scores were used. Refer to the partial reasoning trace below to understand how the agent used that information to answer the query.

When is the L-hook touching the gallbladder?

Video
4D Visualization
[...] Okay, so the tool response shows the distances between node 10 (L-hook) and node 5 (Connective Tissue) for all timesteps. The distances are all very small, like 0.15 to 0.3 cm, which is close to zero. Since the coordinate system is in cm, a distance of 0.1 cm would mean they are touching or very close. [...]

The minimal distance values (0.1 cm) indicate contact. The distance starts increasing at t=14, so the contact likely stops around t=13. So the range is 0 to 13 inclusive. Let's check the exact values: at t=13, the distance is 0.1143, and t=14 is 0.2096. So the contact is up to t=13. So the answer should be [0, 13].

Directional reasoning

It is a key surgical skill to apply traction and counter-traction by grasping and retracting tissue to create tension on the dissection plane. Therefore, to understand subtle tool-tissue interactions, we explore reasoning on motions unfolding over time and space. We evaluate on direct manipulation such as a tool manipulating anatomy, but also on the effect on secondary, surrounding tissue.
To interpret the 4D visualizations and reasoning traces correctly, beware that the queries apply to a specific time window and the agent primarly uses tools to get 4D trajectories for the relevant window. The agent operates directly on trajectories but our output format is simplified to unit 3D direction vectors to allow fair comparison with 2D baselines that do not have access to explicit 3D coordinates. Start and end 3D points that mark a trajectory as computed by the used tool are visualized with start and end labels. By the presence of the labels, you can infer the relevant time window that the query refers to. You may also scrub the timeline and use the arrow keys to navigate the video for a closer look.

In which direction is the grasper pulling the gallbladder?

Video
4D Visualization
[...] The user asked for the direction the grasper is pulling the gallbladder in timesteps 0-9. The aggregated movement of the gallbladder (node 5) from 0 to 9 was given as x: 2.3809, y: -4.3444, z: 1.4213. [...]
So the direction vector is {x: 1, y: -1, z: 1}.

In which direction is the liver moving?

Video
4D Visualization
[...] Okay, so I need to figure out the direction the liver is moving between timesteps 2 and 8. The liver has two nodes: 3 and 4. The aggregated movement for node 3 is x: -1.1117, y: 2.04, z: -1.9683. For node 4, it's x: -0.4594, y: 1.0699, z: -0.6127.
First, check the x-direction. Both have negative x movement. For node 3, -1.1117 (left), node 4: -0.4594 (left). So both moving left, so x direction is -1.
Y-direction: both have positive y (2.04 and 1.0699), which is downward (since y increases downward). So y direction is 1.
Z-direction: both have negative z (z is movement toward camera, since z increases away from camera). Node 3: -1.9683 (toward camera), node 4: -0.6127 (toward camera). So z direction is -1.
Both liver nodes are moving in the same direction for all axes. So the liver's movement direction is x: -1, y: 1, z: -1.

In which direction is the gallbladder moving?

Video
4D Visualization
[...] The tool response for aggregated_node_movement for node 4 (gallbladder) from timesteps 7 to 13 gives a movement vector of x: 1.5637, y: -4.6287, z: 1.2824. [...]
So the direction vector should be {"x": 1, "y": -1, "z": 1}.

Miscellaneous



Acknowledgements