1

StereoMamba: Real-time and Robust Intraoperative Stereo Disparity Estimation via Long-range Spatial Dependencies

Stereo disparity estimation is crucial for obtaining depth information in robot-assisted minimally invasive surgery (RAMIS). We propose the StereoMamba architecture, which is specifically designed for stereo disparity estimation in RAMIS. Our approach is based on a novel Feature Extraction Mamba (FE-Mamba) module, which enhances long-range spatial dependencies both within and across stereo images.

Xu Wang, Jialang Xu, Shuai Zhang, Baoru Huang, Danail Stoyanov, Evangelos B Mazomenos

StereoMamba: Real-time and Robust Intraoperative Stereo Disparity Estimation via Long-range Spatial Dependencies

GraspMAS: Zero-Shot Language-Driven Grasp Detection with Multi-Agent System

Language-driven grasp detection has the potential to revolutionize human-robot interaction by allowing robots to understand and execute grasping tasks based on natural language commands. In this paper, we introduce GraspMAS, a new multi-agent system framework for language-driven grasp detection. GraspMAS is designed to reason through ambiguities and improve decision-making in real-world scenarios.

Quang Nguyen, Tri Le, Huy Nguyen, Thieu Vo, Tung D Ta, Baoru Huang, Minh N Vu, Anh Nguyen

GraspMAS: Zero-Shot Language-Driven Grasp Detection with Multi-Agent System

SplineFormer: An Explainable Transformer-Based Approach for Autonomous Endovascular Navigation

Endovascular navigation is a crucial aspect of minimally invasive procedures, where precise control of curvilinear instruments like guidewires is critical for successful interventions. We propose SplineFormer, a new transformer-based architecture, designed specifically to predict the continuous, smooth shape of the guidewire in an explainable way.

Tudor Jianu, Shayan Doust, Mengyun Li, Baoru Huang, Tuong Do, Hoan Nguyen, Karl Bates, Tung D Ta, Sebastiano Fichera, Pierre Berthet-Rayne, Anh Nguyen

SplineFormer: An Explainable Transformer-Based Approach for Autonomous Endovascular Navigation

Robotic-CLIP: Fine-tuning CLIP on Action Data for Robotic Applications

Vision language models have played a key role in extracting meaningful features for various robotic applications. In this paper, we introduce Robotic-CLIP to enhance robotic perception capabilities.

Nghia Nguyen, Minh Nhat Vu, Tung D Ta, Baoru Huang, Thieu Vo, Ngan Le, Anh Nguyen

Robotic-CLIP: Fine-tuning CLIP on Action Data for Robotic Applications

Tracking everything in robotic-assisted surgery

Accurate tracking of tissues and instruments in videos is crucial for Robotic-Assisted Minimally Invasive Surgery. We introduce a new annotated surgical tracking dataset for benchmarking tracking methods for surgical scenarios, comprising real-world surgical videos with complex tissue and instrument motions.

Bohan Zhan, Wang Zhao, Yi Fang, Bo Du, Francisco Vasconcelos, Danail Stoyanov, Daniel S Elson, Baoru Huang

Tracking everything in robotic-assisted surgery

FedEFM: Federated Endovascular Foundation Model with Unseen Data

In endovascular surgery, the precise identification of catheters and guidewires in X-ray images is essential for reducing intervention risks. This paper proposes a new method to train a foundation model in a decentralized federated learning setting for endovascular intervention.

Tuong Do, Nghia Vu, Tudor Jianu, Baoru Huang, Minh Vu, Jionglong Su, Erman Tjiputra, Quang D Tran, Te-Chuan Chiu, Anh Nguyen

FedEFM: Federated Endovascular Foundation Model with Unseen Data

Hybrid Deep Reinforcement Learning for Radio Tracer Localisation in Robotic-assisted Radioguided Surgery

This paper presents a learning-based method to realize the autonomous radiotracer detection in robot-assisted surgeries by navigating the probe to the radioactive target. Real-world evaluation on the da Vinci Research Kit (dVRK) further confirms the feasibility of the approach, achieving an 80% success rate in radiotracer detection.

Hanyi Zhang, Kaizhong Deng, Zhaoyang Jacopo Hu, Baoru Huang, Daniel S Elson

SurgicalGS: Dynamic 3d gaussian splatting for accurate robotic-assisted surgical scene reconstruction

Accurate 3D reconstruction of dynamic surgical scenes from endoscopic video is essential for robotic-assisted surgery. We present SurgicalGS, a dynamic 3D Gaussian Splatting framework specifically designed for surgical scene reconstruction with improved geometric accuracy.

Jialei Chen, Xin Zhang, Mobarakol Islam, Francisco Vasconcelos, Danail Stoyanov, Daniel S Elson, Baoru Huang

HabiCrowd: A High Performance Simulator for Crowd-Aware Visual Navigation

We present HabiCrowd, a benchmark for crowd-aware visual navigation, integrating diverse human dynamics into photorealistic environments. HabiCrowd achieves state-of-the-art collision avoidance and superior computational efficiency, advancing studies in human-robot interaction and navigation.

An Dinh Vuong, Toan Tien Nguyen, Minh Nhat Vu, Baoru Huang, Dzung Nguyen, Huynh Thi Thanh Binh, Thieu Vo, Anh Nguyen

Language-driven Grasp Detection with Mask-guided Attention

We propose a novel method for language-driven grasp detection using mask-guided attention and transformer mechanisms with semantic segmentation features. By integrating visual data and natural language, our approach achieves a 10% improvement in grasp detection accuracy and excels in real-world robotic experiments.

Tuan Van Vo, Minh Nhat Vu, Baoru Huang, An Vuong, Ngan Le, Thieu Vo, Anh Nguyen

Language-driven Grasp Detection with Mask-guided Attention