Robotics
♻ ★ FIMD: Fast Isolated Marker Detection for UV-Based Visual Relative Localisation in Agile UAV Swarms
A novel approach for the fast onboard detection of isolated markers for
visual relative localisation of multiple teammates in agile UAV swarms is
introduced in this paper. As the detection forms a key component of real-time
localisation systems, a three-fold innovation is presented, consisting of an
optimised procedure for CPUs, a GPU shader program, and a functionally
equivalent FPGA streaming architecture. For the proposed CPU and GPU solutions,
the mean processing time per pixel of input camera frames was accelerated by
two to three orders of magnitude compared to the \rev{unoptimised
state-of-the-art approach}. For the localisation task, the proposed FPGA
architecture offered the most significant overall acceleration by minimising
the total delay from camera exposure to detection results. Additionally, the
proposed solutions were evaluated on various 32-bit and 64-bit embedded
platforms to demonstrate their efficiency, as well as their feasibility for
applications using low-end UAVs and MAVs. Thus, it has become a crucial
enabling technology for agile UAV swarming.
♻ ★ MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning
Integrating visual-language instructions into visuomotor policies is gaining
momentum in robot learning for enhancing open-world generalization. Despite
promising advances, existing approaches face two challenges: limited language
steerability when no generated reasoning is used as a condition, or significant
inference latency when reasoning is incorporated. In this work, we introduce
MoTVLA, a mixture-of-transformers (MoT)-based vision-language-action (VLA)
model that integrates fast-slow unified reasoning with behavior policy
learning. MoTVLA preserves the general intelligence of pre-trained VLMs
(serving as the generalist) for tasks such as perception, scene understanding,
and semantic planning, while incorporating a domain expert, a second
transformer that shares knowledge with the pretrained VLM, to generate
domain-specific fast reasoning (e.g., robot motion decomposition), thereby
improving policy execution efficiency. By conditioning the action expert on
decomposed motion instructions, MoTVLA can learn diverse behaviors and
substantially improve language steerability. Extensive evaluations across
natural language processing benchmarks, robotic simulation environments, and
real-world experiments confirm the superiority of MoTVLA in both fast-slow
reasoning and manipulation task performance.
♻ ★ VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation
Zehao Ni, Yonghao He, Lingfeng Qian, Jilei Mao, Fa Fu, Wei Sui, Hu Su, Junran Peng, Zhipeng Wang, Bin He
In the context of imitation learning, visuomotor-based diffusion policy
learning is one of the main directions in robotic manipulation. Most of these
approaches rely on point clouds as observation inputs and construct scene
representations through point clouds feature learning, which enables them to
achieve remarkable accuracy. However, the existing literature lacks an in-depth
exploration of vision-only solutions that have significant potential. In this
paper, we propose a Vision-Only and single-view Diffusion Policy learning
method (VO-DP) that leverages pretrained visual foundation models to achieve
effective fusion of semantic and geometric features. We utilize intermediate
features from VGGT incorporating semantic features from DINOv2 and geometric
features from Alternating Attention blocks. Features are fused via
cross-attention and spatially compressed with a CNN to form the input to the
policy head. Extensive experiments demonstrate that VO-DP not only outperforms
the vision-only baseline DP significantly but also exhibits distinct
performance trends against the point cloud-based method DP3: in simulation
tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0%
and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%,
outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further
robustness evaluations confirm that VO-DP remains highly stable under varying
conditions including color, size, background, and lighting. Lastly, we
open-source a training library for robotic manipulation. Built on Accelerate,
this library supports multi-machine and multi-GPU parallel training, as well as
mixed precision training. It is compatible with visuomotor policies such as DP,
DP3 and VO-DP, and also supports the RoboTwin simulator.
♻ ★ Local Guidance for Configuration-Based Multi-Agent Pathfinding
Guidance is an emerging concept that improves the empirical performance of
real-time, sub-optimal multi-agent pathfinding (MAPF) methods. It offers
additional information to MAPF algorithms to mitigate congestion on a global
scale by considering the collective behavior of all agents across the entire
workspace. This global perspective helps reduce agents' waiting times, thereby
improving overall coordination efficiency. In contrast, this study explores an
alternative approach: providing local guidance in the vicinity of each agent.
While such localized methods involve recomputation as agents move and may
appear computationally demanding, we empirically demonstrate that supplying
informative spatiotemporal cues to the planner can significantly improve
solution quality without exceeding a moderate time budget. When applied to
LaCAM, a leading configuration-based solver, this form of guidance establishes
a new performance frontier for MAPF.
comment: 10 pages