@inproceedings{}
1 Computer Aided Medical Procedures, TU Munich, Munich, Germany
2 TUM.ai, Munich, Germany
3 University Hospital of Ludwig Maximilian University (LMU)
Munich, Munich, Germany
4 Munich Center for Machine Learning, Munich, Germany
How To Navigate Below 4D Scenes
To perform spatial grounding in 3D, we pose a query for a specific timestep. Below examples show the corresponding frame on the left, the interactive 4D visualization for the whole sequence in the middle, and the 2D projection of the model's answer on the right, overlayed as a blue dot on the frame corresponding to the query. Note that the agent does not receive that frame and only operates on the 4D representation.
To interpret the 4D visualization correctly:
Where is the grasper in the back gripping the gallbladder?


Where is the L-hook coagulating the gallbladder?


Where is the grasper behind the L-hook holding the gallbladder?


Where does the grasper grip the gallbladder?


Where is the L-hook hooked into the connective tissue of the gallbladder?


For temporal grounding, we ask the agent to either determine the starting point or the duration of a query. All temporal queries require spatial reasoning as well, as shown in the example below.
To interpret the visualization:
When is the L-hook touching the gallbladder?
[...] Okay, so the tool response shows the distances between node 10 (L-hook) and node 5 (Connective Tissue) for all timesteps. The distances are all very small, like 0.15 to 0.3 cm, which is close to zero. Since the coordinate system is in cm, a distance of 0.1 cm would mean they are touching or very close. [...] The minimal distance values (0.1 cm) indicate contact. The distance starts increasing at t=14, so the contact likely stops around t=13. So the range is 0 to 13 inclusive. Let's check the exact values: at t=13, the distance is 0.1143, and t=14 is 0.2096. So the contact is up to t=13. So the answer should be [0, 13].
It is a key surgical skill to apply traction and counter-traction by grasping and retracting tissue to create tension on the dissection plane. Therefore, to understand subtle tool-tissue interactions, we explore reasoning on motions unfolding over time and space. We evaluate on direct manipulation such as a tool manipulating anatomy, but also on the effect on secondary, surrounding tissue.
To interpret the 4D visualizations and reasoning traces correctly:
In which direction is the grasper pulling the gallbladder?
[...] The user asked for the direction the grasper is pulling the gallbladder in timesteps 0-9. The aggregated movement of the gallbladder (node 5) from 0 to 9 was given as x: 2.3809, y: -4.3444, z: 1.4213. [...]
So the direction vector is {x: 1, y: -1, z: 1}.
In which direction is the liver moving?
[...] Okay, so I need to figure out the direction the liver is moving between timesteps 2 and 8. The liver has two nodes: 3 and 4. The aggregated movement for node 3 is x: -1.1117, y: 2.04, z: -1.9683. For node 4, it's x: -0.4594, y: 1.0699, z: -0.6127. First, check the x-direction. Both have negative x movement. For node 3, -1.1117 (left), node 4: -0.4594 (left). So both moving left, so x direction is -1. Y-direction: both have positive y (2.04 and 1.0699), which is downward (since y increases downward). So y direction is 1. Z-direction: both have negative z (z is movement toward camera, since z increases away from camera). Node 3: -1.9683 (toward camera), node 4: -0.6127 (toward camera). So z direction is -1. Both liver nodes are moving in the same direction for all axes. So the liver's movement direction is x: -1, y: 1, z: -1.
In which direction is the gallbladder moving?
[...] The tool response for aggregated_node_movement for node 4 (gallbladder) from timesteps 7 to 13 gives a movement vector of x: 1.5637, y: -4.6287, z: 1.2824. [...]
So the direction vector should be {"x": 1, "y": -1, "z": 1}.