ARGUS - Adaptive Re-detection with Grounding for Unconstrained Robot Scenes
AI & Data
Semester programme:Enhanced AI Techniques
Research group:Sustainable Data & AI Application
Project group members:Tomov, Georgi G.T.
Iracá, Tim T.
Karaarslan, Güray M.G.
Rutjens, Tim T.G.S.
Project description
The project investigates how modern vision models can be combined into a robust object-centric perception pipeline for robot failure analysis. The main challenge is maintaining reliable object tracking in real-world robot recordings despite occlusion, tracking drift, changing object appearance, and detector uncertainty.
To address this, we designed ARGUS, a modular detection and tracking pipeline that automatically recovers from tracking failures through adaptive re-detection and tracker reinitialization.
Context
This project is situated in the field of robotics and computer vision. Robot failure analysis requires accurate understanding of object interactions during task execution. Existing frameworks such as REFLECT provide multimodal reasoning capabilities but are less suited for rapid experimentation with newer perception models.
ARGUS was developed to provide a flexible architecture where detection, tracking, validation, and scene understanding components can be independently replaced and evaluated. The pipeline processes RGB-D recordings of robot tasks and combines open-set object detection, object tracking, and validation mechanisms to maintain object awareness throughout an episode. This enables more reliable downstream analysis of robot behavior and failure causes.
Results
The primary outcome of the project is a fully functioning perception pipeline that integrates Grounding DINO, YOLOE, BoTSORT, and scene graphs into a modular framework. The system automatically detects target objects, tracks them throughout a robot task, and performs re-detection when tracking quality degrades.
The project produced several reusable software modules, including object detection, tracking, validation, logging, and state management components. The pipeline generates annotated tracking videos, structured JSONL logs, and per-frame state snapshots that can be used by depth and scene graph analysis stages, and later down stream by a reasoning LLM.
Validation demonstrated that the pipeline can successfully recover from common tracking failures such as object loss, bounding box drift, and large appearance changes. The modular design also enables straightforward evaluation of alternative detection and tracking approaches. The resulting system provides a practical foundation for future research and experimentation in robot failure analysis and object-centric perception.