I am currently an Assistant Professor in the Department of Computer Science at the University of Liverpool. Prior to this, I worked as a Research Fellow in the Department of Computer Science at University College London, funded by the EU Horizon 2020. I completed my PhD at the Hamlyn Centre, Imperial College London. I also had the privilege of working as a Research Scientist at Reality Labs, Meta (Facebook) before.
I earned my BEng degree in Mechanical Engineering from the University of Birmingham, UK, in 2018, followed by an MRes degree in Medical Robotics and Image-Guided Intervention from Imperial College London, UK, in 2019.
PhD in AI, Computer Vision & Medical Robotics, 2023
Imperial College London
MRes in Medical Robotics and Image-guided Intervention (with Distinction), 2019
Imperial College London
BEng in Mechanical Engineering (with Honours Class I), 2018
University of Birmingham
Language-driven grasp detection has the potential to revolutionize human-robot interaction by allowing robots to understand and execute grasping tasks based on natural language commands. In this paper, we introduce GraspMAS, a new multi-agent system framework for language-driven grasp detection. GraspMAS is designed to reason through ambiguities and improve decision-making in real-world scenarios.
Vision language models have played a key role in extracting meaningful features for various robotic applications. In this paper, we introduce Robotic-CLIP to enhance robotic perception capabilities.
Accurate tracking of tissues and instruments in videos is crucial for Robotic-Assisted Minimally Invasive Surgery. We introduce a new annotated surgical tracking dataset for benchmarking tracking methods for surgical scenarios, comprising real-world surgical videos with complex tissue and instrument motions.
Accurate 3D reconstruction of dynamic surgical scenes from endoscopic video is essential for robotic-assisted surgery. We present SurgicalGS, a dynamic 3D Gaussian Splatting framework specifically designed for surgical scene reconstruction with improved geometric accuracy.
We present a novel approach for language-driven 6-DoF grasp detection in cluttered point clouds, introducing Grasp-Anything-6D, a large-scale dataset, and a diffusion model with negative prompt guidance to enable robots to grasp objects based on natural language commands, surpassing baselines in both benchmarks and real-world applications.
We present Grasp-Anything, a large-scale grasp detection dataset synthesized using foundation models to address the limited diversity of existing datasets. With 1M samples and over 3M objects, it enables zero-shot grasp detection and excels in vision-based and real-world robotic tasks. Code and data are available.
We propose a method for language-conditioned affordance detection and 6-DoF pose estimation in 3D point clouds, enabling robots to handle diverse affordances beyond predefined sets. Our approach features an open-vocabulary affordance detection branch and a language-guided diffusion model for pose generation. A new dataset supports the task, and experiments show significant performance improvements over baselines. The method demonstrates strong potential in real-world robotic applications.
We introduce an open-vocabulary affordance detection method for 3D point clouds, addressing the challenges of complex object shapes and diverse affordances. Using knowledge distillation and a novel text-point correlation approach, our method enhances feature extraction and semantic understanding. It outperforms baselines with a 7.96% mIOU improvement and supports real-time inference, ideal for robotic manipulation tasks.
Robot grasp detection is a complex challenge with significant industrial relevance. To address this, we present Grasp-Anything++, a new language-driven grasp detection dataset containing 1M samples, over 3M objects, and 10M grasping instructions. Leveraging foundation models, we frame grasp detection as a conditional generation task and propose a novel diffusion model-based method with a contrastive training objective to improve language-guided grasp pose detection. Our approach surpasses state-of-the-art methods, supports real-world robotic grasping, and enables zero-shot grasp detection. The dataset serves as a challenging benchmark, promoting advancements in language-driven robotic grasping research.
We propose a novel Residual Aligner-based Network (RAN) for deformable image registration, addressing challenges in capturing separate and sliding motions of organs. By introducing a Motion Separable backbone and a Residual Aligner module, RAN achieves state-of-the-art accuracy in unsupervised registration of abdominal and lung CT scans, with reduced model size and computational cost.
We propose a simple regression network to enhance intraoperative gamma activity visualization in endoscopic radio-guided cancer detection and resection. By leveraging high-dimensional image features and probe position data, our method effectively detects sensing areas, outperforming prior geometric approaches.