@inproceedings{}
1 Computer Aided Medical Procedures, TU Munich, Munich, Germany
2 TUM.ai, Munich, Germany
3 University Hospital of Ludwig Maximilian University (LMU)
Munich, Munich, Germany
4 Munich Center for Machine Learning, Munich, Germany
Each row below shows the monocular input clip (left), the RGB 4D reconstruction (center), and the same geometry with semantic instance colors (right). The 4D viewers are interactive: pan by holding left-click and moving your cursor, translate the camera by holding left- and right-click, rotate around the camera axis by holding option, and zoom by scrolling. With space, you can start and stop playback. With the arrow keys, you can step frame by frame. You may also use the control bar to loop or scrub the timeline. Loading the Rerun files may take a moment.
To perform spatial grounding in 3D, we pose a query for a specific timestep. Below examples show the corresponding frame on the left, the interactive 4D visualization for the whole sequence in the middle, and the 2D projection of the model's answer on the right, overlayed as a blue dot on the frame corresponding to the query. Note that the agent does not receive that frame and only operates on the 4D representation.
To interpret the 4D visualization correctly: at the specific timestep, structures and interaction points that the agent requested via a tool call will light up in blue and red and their 3D state at that time will remain frozen and visible as the sequence progresses. You may use the arrow keys or the timeline scrollbar to easily find the relevant timestep. For the agent, only this state matters for the spatial queries. Interaction points are highlighted with a red dot in 3D. We also project the point into the 2D frame, as shown in the images on the right, which allows for comparison with 2D baselines.
Where is the grasper in the back gripping the gallbladder?


Where is the L-hook coagulating the gallbladder?


Where is the grasper behind the L-hook holding the gallbladder?


Where does the grasper grip the gallbladder?


Where is the L-hook hooked into the connective tissue of the gallbladder?


For temporal grounding, we ask the agent to either determine the starting point or the duration of a query. All temporal queries require spatial reasoning as well, as shown in the example below.
To interpret the visualization, the relevant structures the agent interacted with via tool calls are again highlighted in dark blue and red. Here, they were requested over the whole sequence and distance and overlap scores were used. Refer to the partial reasoning trace below to understand how the agent used that information to answer the query.
When is the L-hook touching the gallbladder?
[...] Okay, so the tool response shows the distances between node 10 (L-hook) and node 5 (Connective Tissue) for all timesteps. The distances are all very small, like 0.15 to 0.3 cm, which is close to zero. Since the coordinate system is in cm, a distance of 0.1 cm would mean they are touching or very close. [...] The minimal distance values (0.1 cm) indicate contact. The distance starts increasing at t=14, so the contact likely stops around t=13. So the range is 0 to 13 inclusive. Let's check the exact values: at t=13, the distance is 0.1143, and t=14 is 0.2096. So the contact is up to t=13. So the answer should be [0, 13].
It is a key surgical skill to apply traction and counter-traction by grasping and retracting tissue to create tension on the dissection plane. Therefore, to understand subtle tool-tissue interactions, we explore reasoning on motions unfolding over time and space. We evaluate on direct manipulation such as a tool manipulating anatomy, but also on the effect on secondary, surrounding tissue.
To interpret the 4D visualizations and reasoning traces correctly, beware that the queries apply to a specific time window and the agent primarly uses tools to get 4D trajectories for the relevant window. The agent operates directly on trajectories but our output format is simplified to unit 3D direction vectors to allow fair comparison with 2D baselines that do not have access to explicit 3D coordinates. Start and end 3D points that mark a trajectory as computed by the used tool are visualized with start and end labels. By the presence of the labels, you can infer the relevant time window that the query refers to. You may also scrub the timeline and use the arrow keys to navigate the video for a closer look.
In which direction is the grasper pulling the gallbladder?
[...] The user asked for the direction the grasper is pulling the gallbladder in timesteps 0-9. The aggregated movement of the gallbladder (node 5) from 0 to 9 was given as x: 2.3809, y: -4.3444, z: 1.4213. [...]
So the direction vector is {x: 1, y: -1, z: 1}.
In which direction is the liver moving?
[...] Okay, so I need to figure out the direction the liver is moving between timesteps 2 and 8. The liver has two nodes: 3 and 4. The aggregated movement for node 3 is x: -1.1117, y: 2.04, z: -1.9683. For node 4, it's x: -0.4594, y: 1.0699, z: -0.6127. First, check the x-direction. Both have negative x movement. For node 3, -1.1117 (left), node 4: -0.4594 (left). So both moving left, so x direction is -1. Y-direction: both have positive y (2.04 and 1.0699), which is downward (since y increases downward). So y direction is 1. Z-direction: both have negative z (z is movement toward camera, since z increases away from camera). Node 3: -1.9683 (toward camera), node 4: -0.6127 (toward camera). So z direction is -1. Both liver nodes are moving in the same direction for all axes. So the liver's movement direction is x: -1, y: 1, z: -1.
In which direction is the gallbladder moving?
[...] The tool response for aggregated_node_movement for node 4 (gallbladder) from timesteps 7 to 13 gives a movement vector of x: 1.5637, y: -4.6287, z: 1.2824. [...]
So the direction vector should be {"x": 1, "y": -1, "z": 1}.