A ROS 2 Wrapper for Florence-2: Multi-Mode Local Vision-Language Inference for Robotic Systems

J. E. Domínguez-Vidal|April 1, 2026arXiv

Key Takeaway

Florence-2 can now be easily integrated into robot software stacks through a standardized ROS 2 wrapper, enabling local vision-language inference on consumer GPUs without cloud dependencies.

Summary

This paper presents a ROS 2 software wrapper that integrates Florence-2, a vision-language model, into robotic systems for local inference.

applications multimodal efficiency

Key Terms

vision-language-model ros-2 local-deployment open-vocabulary-detection