Robotics
★ Lightning Grasp: High Performance Procedural Grasp Synthesis with Contact Fields
Despite years of research, real-time diverse grasp synthesis for dexterous
hands remains an unsolved core challenge in robotics and computer graphics. We
present Lightning Grasp, a novel high-performance procedural grasp synthesis
algorithm that achieves orders-of-magnitude speedups over state-of-the-art
approaches, while enabling unsupervised grasp generation for irregular,
tool-like objects. The method avoids many limitations of prior approaches, such
as the need for carefully tuned energy functions and sensitive initialization.
This breakthrough is driven by a key insight: decoupling complex geometric
computation from the search process via a simple, efficient data structure -
the Contact Field. This abstraction collapses the problem complexity, enabling
a procedural search at unprecedented speeds. We open-source our system to
propel further innovation in robotic manipulation.
comment: Code: https://github.com/zhaohengyin/lightning-grasp
★ Robot Learning from a Physical World Model
Jiageng Mao, Sicheng He, Hao-Ning Wu, Yang You, Shuyang Sun, Zhicheng Wang, Yanan Bao, Huizhong Chen, Leonidas Guibas, Vitor Guizilini, Howard Zhou, Yue Wang
We introduce PhysWorld, a framework that enables robot learning from video
generation through physical world modeling. Recent video generation models can
synthesize photorealistic visual demonstrations from language commands and
images, offering a powerful yet underexplored source of training signals for
robotics. However, directly retargeting pixel motions from generated videos to
robots neglects physics, often resulting in inaccurate manipulations. PhysWorld
addresses this limitation by coupling video generation with physical world
reconstruction. Given a single image and a task command, our method generates
task-conditioned videos and reconstructs the underlying physical world from the
videos, and the generated video motions are grounded into physically accurate
actions through object-centric residual reinforcement learning with the
physical world model. This synergy transforms implicit visual guidance into
physically executable robotic trajectories, eliminating the need for real robot
data collection and enabling zero-shot generalizable robotic manipulation.
Experiments on diverse real-world tasks demonstrate that PhysWorld
substantially improves manipulation accuracy compared to previous approaches.
Visit \href{https://pointscoder.github.io/PhysWorld_Web/}{the project webpage}
for details.
comment: Project page: https://pointscoder.github.io/PhysWorld_Web/
★ TwinOR: Photorealistic Digital Twins of Dynamic Operating Rooms for Embodied AI Research
Han Zhang, Yiqing Shen, Roger D. Soberanis-Mukul, Ankita Ghosh, Hao Ding, Lalithkumar Seenivasan, Jose L. Porras, Zhekai Mao, Chenjia Li, Wenjie Xiao, Lonny Yarmus, Angela Christine Argento, Masaru Ishii, Mathias Unberath
Developing embodied AI for intelligent surgical systems requires safe,
controllable environments for continual learning and evaluation. However,
safety regulations and operational constraints in operating rooms (ORs) limit
embodied agents from freely perceiving and interacting in realistic settings.
Digital twins provide high-fidelity, risk-free environments for exploration and
training. How we may create photorealistic and dynamic digital representations
of ORs that capture relevant spatial, visual, and behavioral complexity remains
unclear. We introduce TwinOR, a framework for constructing photorealistic,
dynamic digital twins of ORs for embodied AI research. The system reconstructs
static geometry from pre-scan videos and continuously models human and
equipment motion through multi-view perception of OR activities. The static and
dynamic components are fused into an immersive 3D environment that supports
controllable simulation and embodied exploration. The proposed framework
reconstructs complete OR geometry with centimeter level accuracy while
preserving dynamic interaction across surgical workflows, enabling realistic
renderings and a virtual playground for embodied AI systems. In our
experiments, TwinOR simulates stereo and monocular sensor streams for geometry
understanding and visual localization tasks. Models such as FoundationStereo
and ORB-SLAM3 on TwinOR-synthesized data achieve performance within their
reported accuracy on real indoor datasets, demonstrating that TwinOR provides
sensor-level realism sufficient for perception and localization challenges. By
establishing a real-to-sim pipeline for constructing dynamic, photorealistic
digital twins of OR environments, TwinOR enables the safe, scalable, and
data-efficient development and benchmarking of embodied AI, ultimately
accelerating the deployment of embodied AI from sim-to-real.
★ Using Vision Language Models as Closed-Loop Symbolic Planners for Robotic Applications: A Control-Theoretic Perspective
Large Language Models (LLMs) and Vision Language Models (VLMs) have been
widely used for embodied symbolic planning. Yet, how to effectively use these
models for closed-loop symbolic planning remains largely unexplored. Because
they operate as black boxes, LLMs and VLMs can produce unpredictable or costly
errors, making their use in high-level robotic planning especially challenging.
In this work, we investigate how to use VLMs as closed-loop symbolic planners
for robotic applications from a control-theoretic perspective. Concretely, we
study how the control horizon and warm-starting impact the performance of VLM
symbolic planners. We design and conduct controlled experiments to gain
insights that are broadly applicable to utilizing VLMs as closed-loop symbolic
planners, and we discuss recommendations that can help improve the performance
of VLM symbolic planners.
★ Unified Humanoid Fall-Safety Policy from a Few Demonstrations
Falling is an inherent risk of humanoid mobility. Maintaining stability is
thus a primary safety focus in robot control and learning, yet no existing
approach fully averts loss of balance. When instability does occur, prior work
addresses only isolated aspects of falling: avoiding falls, choreographing a
controlled descent, or standing up afterward. Consequently, humanoid robots
lack integrated strategies for impact mitigation and prompt recovery when real
falls defy these scripts. We aim to go beyond keeping balance to make the
entire fall-and-recovery process safe and autonomous: prevent falls when
possible, reduce impact when unavoidable, and stand up when fallen. By fusing
sparse human demonstrations with reinforcement learning and an adaptive
diffusion-based memory of safe reactions, we learn adaptive whole-body
behaviors that unify fall prevention, impact mitigation, and rapid recovery in
one policy. Experiments in simulation and on a Unitree G1 demonstrate robust
sim-to-real transfer, lower impact forces, and consistently fast recovery
across diverse disturbances, pointing towards safer, more resilient humanoids
in real environments. Videos are available at https://firm2025.github.io/.
★ Residual Rotation Correction using Tactile Equivariance
Visuotactile policy learning augments vision-only policies with tactile
input, facilitating contact-rich manipulation. However, the high cost of
tactile data collection makes sample efficiency the key requirement for
developing visuotactile policies. We present EquiTac, a framework that exploits
the inherent SO(2) symmetry of in-hand object rotation to improve sample
efficiency and generalization for visuotactile policy learning. EquiTac first
reconstructs surface normals from raw RGB inputs of vision-based tactile
sensors, so rotations of the normal vector field correspond to in-hand object
rotations. An SO(2)-equivariant network then predicts a residual rotation
action that augments a base visuomotor policy at test time, enabling real-time
rotation correction without additional reorientation demonstrations. On a real
robot, EquiTac accurately achieves robust zero-shot generalization to unseen
in-hand orientations with very few training samples, where baselines fail even
with more training data. To our knowledge, this is the first tactile learning
method to explicitly encode tactile equivariance for policy learning, yielding
a lightweight, symmetry-aware module that improves reliability in contact-rich
tasks.
comment: 8 pages
★ Real-Time LiDAR Super-Resolution via Frequency-Aware Multi-Scale Fusion
LiDAR super-resolution addresses the challenge of achieving high-quality 3D
perception from cost-effective, low-resolution sensors. While recent
transformer-based approaches like TULIP show promise, they remain limited to
spatial-domain processing with restricted receptive fields. We introduce FLASH
(Frequency-aware LiDAR Adaptive Super-resolution with Hierarchical fusion), a
novel framework that overcomes these limitations through dual-domain
processing. FLASH integrates two key innovations: (i) Frequency-Aware Window
Attention that combines local spatial attention with global frequency-domain
analysis via FFT, capturing both fine-grained geometry and periodic scanning
patterns at log-linear complexity. (ii) Adaptive Multi-Scale Fusion that
replaces conventional skip connections with learned position-specific feature
aggregation, enhanced by CBAM attention for dynamic feature selection.
Extensive experiments on KITTI demonstrate that FLASH achieves state-of-the-art
performance across all evaluation metrics, surpassing even uncertainty-enhanced
baselines that require multiple forward passes. Notably, FLASH outperforms
TULIP with Monte Carlo Dropout while maintaining single-pass efficiency, which
enables real-time deployment. The consistent superiority across all distance
ranges validates that our dual-domain approach effectively handles uncertainty
through architectural design rather than computationally expensive stochastic
inference, making it practical for autonomous systems.
★ Exact Smooth Reformulations for Trajectory Optimization Under Signal Temporal Logic Specifications
We study motion planning under Signal Temporal Logic (STL), a useful
formalism for specifying spatial-temporal requirements. We pose STL synthesis
as a trajectory optimization problem leveraging the STL robustness semantics.
To obtain a differentiable problem without approximation error, we introduce an
exact reformulation of the max and min operators. The resulting method is
exact, smooth, and sound. We validate it in numerical simulations,
demonstrating its practical performance.
★ PlanT 2.0: Exposing Biases and Structural Flaws in Closed-Loop Driving
Most recent work in autonomous driving has prioritized benchmark performance
and methodological innovation over in-depth analysis of model failures, biases,
and shortcut learning. This has led to incremental improvements without a deep
understanding of the current failures. While it is straightforward to look at
situations where the model fails, it is hard to understand the underlying
reason. This motivates us to conduct a systematic study, where inputs to the
model are perturbed and the predictions observed. We introduce PlanT 2.0, a
lightweight, object-centric planning transformer designed for autonomous
driving research in CARLA. The object-level representation enables controlled
analysis, as the input can be easily perturbed (e.g., by changing the location
or adding or removing certain objects), in contrast to sensor-based models. To
tackle the scenarios newly introduced by the challenging CARLA Leaderboard 2.0,
we introduce multiple upgrades to PlanT, achieving state-of-the-art performance
on Longest6 v2, Bench2Drive, and the CARLA validation routes. Our analysis
exposes insightful failures, such as a lack of scene understanding caused by
low obstacle diversity, rigid expert behaviors leading to exploitable
shortcuts, and overfitting to a fixed set of expert trajectories. Based on
these findings, we argue for a shift toward data-centric development, with a
focus on richer, more robust, and less biased datasets. We open-source our code
and model at https://github.com/autonomousvision/plant2.
★ Robotic versus Human Teleoperation for Remote Ultrasound
Diagnostic medical ultrasound is widely used, safe, and relatively low cost
but requires a high degree of expertise to acquire and interpret the images.
Personnel with this expertise are often not available outside of larger cities,
leading to difficult, costly travel and long wait times for rural populations.
To address this issue, tele-ultrasound techniques are being developed,
including robotic teleoperation and recently human teleoperation, in which a
novice user is remotely guided in a hand-over-hand manner through mixed reality
to perform an ultrasound exam. These methods have not been compared, and their
relative strengths are unknown. Human teleoperation may be more practical than
robotics for small communities due to its lower cost and complexity, but this
is only relevant if the performance is comparable. This paper therefore
evaluates the differences between human and robotic teleoperation, examining
practical aspects such as setup time and flexibility and experimentally
comparing performance metrics such as completion time, position tracking, and
force consistency. It is found that human teleoperation does not lead to
statistically significant differences in completion time or position accuracy,
with mean differences of 1.8% and 0.5%, respectively, and provides more
consistent force application despite being substantially more practical and
accessible.
comment: Under review at IEEE TMRB. Extended version of a paper presented at
the Hamlyn Symposium for Medical Robotics, 2025
★ Automated Generation of Continuous-Space Roadmaps for Routing Mobile Robot Fleets
Efficient routing of mobile robot fleets is crucial in intralogistics, where
delays and deadlocks can substantially reduce system throughput. Roadmap
design, specifying feasible transport routes, directly affects fleet
coordination and computational performance. Existing approaches are either
grid-based, compromising geometric precision, or continuous-space approaches
that disregard practical constraints. This paper presents an automated roadmap
generation approach that bridges this gap by operating in continuous-space,
integrating station-to-station transport demand and enforcing minimum distance
constraints for nodes and edges. By combining free space discretization,
transport demand-driven $K$-shortest-path optimization, and path smoothing, the
approach produces roadmaps tailored to intralogistics applications. Evaluation
across multiple intralogistics use cases demonstrates that the proposed
approach consistently outperforms established baselines (4-connected grid,
8-connected grid, and random sampling), achieving lower structural complexity,
higher redundancy, and near-optimal path lengths, enabling efficient and robust
routing of mobile robot fleets.
comment: submitted to the IEEE for possible publication; 8 pages, 6 figures, 2
tables
★ Dynamics-Decoupled Trajectory Alignment for Sim-to-Real Transfer in Reinforcement Learning for Autonomous Driving
Reinforcement learning (RL) has shown promise in robotics, but deploying RL
on real vehicles remains challenging due to the complexity of vehicle dynamics
and the mismatch between simulation and reality. Factors such as tire
characteristics, road surface conditions, aerodynamic disturbances, and vehicle
load make it infeasible to model real-world dynamics accurately, which hinders
direct transfer of RL agents trained in simulation. In this paper, we present a
framework that decouples motion planning from vehicle control through a spatial
and temporal alignment strategy between a virtual vehicle and the real system.
An RL agent is first trained in simulation using a kinematic bicycle model to
output continuous control actions. Its behavior is then distilled into a
trajectory-predicting agent that generates finite-horizon ego-vehicle
trajectories, enabling synchronization between virtual and real vehicles. At
deployment, a Stanley controller governs lateral dynamics, while longitudinal
alignment is maintained through adaptive update mechanisms that compensate for
deviations between virtual and real trajectories. We validate our approach on a
real vehicle and demonstrate that the proposed alignment strategy enables
robust zero-shot transfer of RL-based motion planning from simulation to
reality, successfully decoupling high-level trajectory generation from
low-level vehicle control.
★ HDCNet: A Hybrid Depth Completion Network for Grasping Transparent and Reflective Objects
Depth perception of transparent and reflective objects has long been a
critical challenge in robotic manipulation.Conventional depth sensors often
fail to provide reliable measurements on such surfaces, limiting the
performance of robots in perception and grasping tasks. To address this issue,
we propose a novel depth completion network,HDCNet,which integrates the
complementary strengths of Transformer,CNN and Mamba
architectures.Specifically,the encoder is designed as a dual-branch
Transformer-CNN framework to extract modality-specific features. At the shallow
layers of the encoder, we introduce a lightweight multimodal fusion module to
effectively integrate low-level features. At the network bottleneck,a
Transformer-Mamba hybrid fusion module is developed to achieve deep integration
of high-level semantic and global contextual information, significantly
enhancing depth completion accuracy and robustness. Extensive evaluations on
multiple public datasets demonstrate that HDCNet achieves
state-of-the-art(SOTA) performance in depth completion
tasks.Furthermore,robotic grasping experiments show that HDCNet substantially
improves grasp success rates for transparent and reflective objects,achieving
up to a 60% increase.
★ Multi-Agent Reinforcement Learning for Deadlock Handling among Autonomous Mobile Robots
This dissertation explores the application of multi-agent reinforcement
learning (MARL) for handling deadlocks in intralogistics systems that rely on
autonomous mobile robots (AMRs). AMRs enhance operational flexibility but also
increase the risk of deadlocks, which degrade system throughput and
reliability. Existing approaches often neglect deadlock handling in the
planning phase and rely on rigid control rules that cannot adapt to dynamic
operational conditions.
To address these shortcomings, this work develops a structured methodology
for integrating MARL into logistics planning and operational control. It
introduces reference models that explicitly consider deadlock-capable
multi-agent pathfinding (MAPF) problems, enabling systematic evaluation of MARL
strategies. Using grid-based environments and an external simulation software,
the study compares traditional deadlock handling strategies with MARL-based
solutions, focusing on PPO and IMPALA algorithms under different training and
execution modes.
Findings reveal that MARL-based strategies, particularly when combined with
centralized training and decentralized execution (CTDE), outperform rule-based
methods in complex, congested environments. In simpler environments or those
with ample spatial freedom, rule-based methods remain competitive due to their
lower computational demands. These results highlight that MARL provides a
flexible and scalable solution for deadlock handling in dynamic intralogistics
scenarios, but requires careful tailoring to the operational context.
comment: for associated repositories, see
https://github.com/Nerozud/dl_reference_models and
https://github.com/Nerozud/FTS_simpel
★ Raspi$^2$USBL: An open-source Raspberry Pi-Based Passive Inverted Ultra-Short Baseline Positioning System for Underwater Robotics
Precise underwater positioning remains a fundamental challenge for underwater
robotics since global navigation satellite system (GNSS) signals cannot
penetrate the sea surface. This paper presents Raspi$^2$USBL, an open-source,
Raspberry Pi-based passive inverted ultra-short baseline (piUSBL) positioning
system designed to provide a low-cost and accessible solution for underwater
robotic research. The system comprises a passive acoustic receiver and an
active beacon. The receiver adopts a modular hardware architecture that
integrates a hydrophone array, a multichannel preamplifier, an oven-controlled
crystal oscillator (OCXO), a Raspberry Pi 5, and an MCC-series data acquisition
(DAQ) board. Apart from the Pi 5, OCXO, and MCC board, the beacon comprises an
impedance-matching network, a power amplifier, and a transmitting transducer.
An open-source C++ software framework provides high-precision clock
synchronization and triggering for one-way travel-time (OWTT) messaging, while
performing real-time signal processing, including matched filtering, array
beamforming, and adaptive gain control, to estimate the time of flight (TOF)
and direction of arrival (DOA) of received signals. The Raspi$^2$USBL system
was experimentally validated in an anechoic tank, freshwater lake, and open-sea
trials. Results demonstrate a slant-range accuracy better than 0.1%, a bearing
accuracy within 0.1$^\circ$, and stable performance over operational distances
up to 1.3 km. These findings confirm that low-cost, reproducible hardware can
deliver research-grade underwater positioning accuracy. By releasing both the
hardware and software as open-source, Raspi$^2$USBL provides a unified
reference platform that lowers the entry barrier for underwater robotics
laboratories, fosters reproducibility, and promotes collaborative innovation in
underwater acoustic navigation and swarm robotics.
★ Integration of Visual SLAM into Consumer-Grade Automotive Localization
Accurate ego-motion estimation in consumer-grade vehicles currently relies on
proprioceptive sensors, i.e. wheel odometry and IMUs, whose performance is
limited by systematic errors and calibration. While visual-inertial SLAM has
become a standard in robotics, its integration into automotive ego-motion
estimation remains largely unexplored. This paper investigates how visual SLAM
can be integrated into consumer-grade vehicle localization systems to improve
performance. We propose a framework that fuses visual SLAM with a lateral
vehicle dynamics model to achieve online gyroscope calibration under realistic
driving conditions. Experimental results demonstrate that vision-based
integration significantly improves gyroscope calibration accuracy and thus
enhances overall localization performance, highlighting a promising path toward
higher automotive localization accuracy. We provide results on both proprietary
and public datasets, showing improved performance and superior localization
accuracy on a public benchmark compared to state-of-the-art methods.
comment: This manuscript has been submitted to the IEEE for possible
publication
★ Multi-Agent AI Framework for Road Situation Detection and C-ITS Message Generation
Conventional road-situation detection methods achieve strong performance in
predefined scenarios but fail in unseen cases and lack semantic interpretation,
which is crucial for reliable traffic recommendations. This work introduces a
multi-agent AI framework that combines multimodal large language models (MLLMs)
with vision-based perception for road-situation monitoring. The framework
processes camera feeds and coordinates dedicated agents for situation
detection, distance estimation, decision-making, and Cooperative Intelligent
Transport System (C-ITS) message generation. Evaluation is conducted on a
custom dataset of 103 images extracted from 20 videos of the TAD dataset. Both
Gemini-2.0-Flash and Gemini-2.5-Flash were evaluated. The results show 100\%
recall in situation detection and perfect message schema correctness; however,
both models suffer from false-positive detections and have reduced performance
in terms of number of lanes, driving lane status and cause code. Surprisingly,
Gemini-2.5-Flash, though more capable in general tasks, underperforms
Gemini-2.0-Flash in detection accuracy and semantic understanding and incurs
higher latency (Table II). These findings motivate further work on fine-tuning
specialized LLMs or MLLMs tailored for intelligent transportation applications.
comment: submitted to TRA 2026
★ Aerial Image Stitching Using IMU Data from a UAV
Unmanned Aerial Vehicles (UAVs) are widely used for aerial photography and
remote sensing applications. One of the main challenges is to stitch together
multiple images into a single high-resolution image that covers a large area.
Featurebased image stitching algorithms are commonly used but can suffer from
errors and ambiguities in feature detection and matching. To address this,
several approaches have been proposed, including using bundle adjustment
techniques or direct image alignment. In this paper, we present a novel method
that uses a combination of IMU data and computer vision techniques for
stitching images captured by a UAV. Our method involves several steps such as
estimating the displacement and rotation of the UAV between consecutive images,
correcting for perspective distortion, and computing a homography matrix. We
then use a standard image stitching algorithm to align and blend the images
together. Our proposed method leverages the additional information provided by
the IMU data, corrects for various sources of distortion, and can be easily
integrated into existing UAV workflows. Our experiments demonstrate the
effectiveness and robustness of our method, outperforming some of the existing
feature-based image stitching algorithms in terms of accuracy and reliability,
particularly in challenging scenarios such as large displacements, rotations,
and variations in camera pose.
★ PanoNav: Mapless Zero-Shot Object Navigation with Panoramic Scene Parsing and Dynamic Memory AAAI 2026
Zero-shot object navigation (ZSON) in unseen environments remains a
challenging problem for household robots, requiring strong perceptual
understanding and decision-making capabilities. While recent methods leverage
metric maps and Large Language Models (LLMs), they often depend on depth
sensors or prebuilt maps, limiting the spatial reasoning ability of Multimodal
Large Language Models (MLLMs). Mapless ZSON approaches have emerged to address
this, but they typically make short-sighted decisions, leading to local
deadlocks due to a lack of historical context. We propose PanoNav, a fully
RGB-only, mapless ZSON framework that integrates a Panoramic Scene Parsing
module to unlock the spatial parsing potential of MLLMs from panoramic RGB
inputs, and a Memory-guided Decision-Making mechanism enhanced by a Dynamic
Bounded Memory Queue to incorporate exploration history and avoid local
deadlocks. Experiments on the public navigation benchmark show that PanoNav
significantly outperforms representative baselines in both SR and SPL metrics.
comment: Accepted as a poster in AAAI 2026
★ Vision-Based System Identification of a Quadrotor
This paper explores the application of vision-based system identification
techniques in quadrotor modeling and control. Through experiments and analysis,
we address the complexities and limitations of quadrotor modeling, particularly
in relation to thrust and drag coefficients. Grey-box modeling is employed to
mitigate uncertainties, and the effectiveness of an onboard vision system is
evaluated. An LQR controller is designed based on a system identification model
using data from the onboard vision system. The results demonstrate consistent
performance between the models, validating the efficacy of vision based system
identification. This study highlights the potential of vision-based techniques
in enhancing quadrotor modeling and control, contributing to improved
performance and operational capabilities. Our findings provide insights into
the usability and consistency of these techniques, paving the way for future
research in quadrotor performance enhancement, fault detection, and
decision-making processes.
★ Vision-Aided Online A* Path Planning for Efficient and Safe Navigation of Service Robots
The deployment of autonomous service robots in human-centric environments is
hindered by a critical gap in perception and planning. Traditional navigation
systems rely on expensive LiDARs that, while geometrically precise, are seman-
tically unaware, they cannot distinguish a important document on an office
floor from a harmless piece of litter, treating both as physically traversable.
While advanced semantic segmentation exists, no prior work has successfully
integrated this visual intelligence into a real-time path planner that is
efficient enough for low-cost, embedded hardware. This paper presents a frame-
work to bridge this gap, delivering context-aware navigation on an affordable
robotic platform. Our approach centers on a novel, tight integration of a
lightweight perception module with an online A* planner. The perception system
employs a semantic segmentation model to identify user-defined visual
constraints, enabling the robot to navigate based on contextual importance
rather than physical size alone. This adaptability allows an operator to define
what is critical for a given task, be it sensitive papers in an office or
safety lines in a factory, thus resolving the ambiguity of what to avoid. This
semantic perception is seamlessly fused with geometric data. The identified
visual constraints are projected as non-geometric obstacles onto a global map
that is continuously updated from sensor data, enabling robust navigation
through both partially known and unknown environments. We validate our
framework through extensive experiments in high-fidelity simulations and on a
real-world robotic platform. The results demonstrate robust, real-time
performance, proving that a cost- effective robot can safely navigate complex
environments while respecting critical visual cues invisible to traditional
planners.
comment: 10 pages
★ Human-Level Actuation for Humanoids
Claims that humanoid robots achieve ``human-level'' actuation are common but
rarely quantified. Peak torque or speed specifications tell us little about
whether a joint can deliver the right combination of torque, power, and
endurance at task-relevant postures and rates. We introduce a comprehensive
framework that makes ``human-level'' measurable and comparable across systems.
Our approach has three components. First, a kinematic \emph{DoF atlas}
standardizes joint coordinate systems and ranges of motion using ISB-based
conventions, ensuring that human and robot joints are compared in the same
reference frames. Second, \emph{Human-Equivalence Envelopes (HEE)} define
per-joint requirements by measuring whether a robot meets human torque
\emph{and} power simultaneously at the same joint angle and rate $(q,\omega)$,
weighted by positive mechanical work in task-specific bands (walking, stairs,
lifting, reaching, and hand actions). Third, the \emph{Human-Level Actuation
Score (HLAS)} aggregates six physically grounded factors: workspace coverage
(ROM and DoF), HEE coverage, torque-mode bandwidth, efficiency, and thermal
sustainability. We provide detailed measurement protocols using dynamometry,
electrical power monitoring, and thermal testing that yield every HLAS input
from reproducible experiments. A worked example demonstrates HLAS computation
for a multi-joint humanoid, showing how the score exposes actuator trade-offs
(gearing ratio versus bandwidth and efficiency) that peak-torque specifications
obscure. The framework serves as both a design specification for humanoid
development and a benchmarking standard for comparing actuation systems, with
all components grounded in published human biomechanics data.
comment: 61 pages, 8 figures, 7 tables, and 12 numbered equations
★ SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation
Taisei Hanyu, Nhat Chung, Huy Le, Toan Nguyen, Yuki Ikebe, Anthony Gunderman, Duy Nguyen Ho Minh, Khoa Vo, Tung Kieu, Kashu Yamazaki, Chase Rainwater, Anh Nguyen, Ngan Le
Inspired by how humans reason over discrete objects and their relationships,
we explore whether compact object-centric and object-relation representations
can form a foundation for multitask robotic manipulation. Most existing robotic
multitask models rely on dense embeddings that entangle both object and
background cues, raising concerns about both efficiency and interpretability.
In contrast, we study object-relation-centric representations as a pathway to
more structured, efficient, and explainable visuomotor control. Our
contributions are two-fold. First, we introduce LIBERO+, a fine-grained
benchmark dataset designed to enable and evaluate object-relation reasoning in
robotic manipulation. Unlike prior datasets, LIBERO+ provides object-centric
annotations that enrich demonstrations with box- and mask-level labels as well
as instance-level temporal tracking, supporting compact and interpretable
visuomotor representations. Second, we propose SlotVLA, a slot-attention-based
framework that captures both objects and their relations for action decoding.
It uses a slot-based visual tokenizer to maintain consistent temporal object
representations, a relation-centric decoder to produce task-relevant
embeddings, and an LLM-driven module that translates these embeddings into
executable actions. Experiments on LIBERO+ demonstrate that object-centric slot
and object-relation slot representations drastically reduce the number of
required visual tokens, while providing competitive generalization. Together,
LIBERO+ and SlotVLA provide a compact, interpretable, and effective foundation
for advancing object-relation-centric robotic manipulation.
comment: under review
★ Semi-distributed Cross-modal Air-Ground Relative Localization IROS 2025
Weining Lu, Deer Bin, Lian Ma, Ming Ma, Zhihao Ma, Xiangyang Chen, Longfei Wang, Yixiao Feng, Zhouxian Jiang, Yongliang Shi, Bin Liang
Efficient, accurate, and flexible relative localization is crucial in
air-ground collaborative tasks. However, current approaches for robot relative
localization are primarily realized in the form of distributed multi-robot SLAM
systems with the same sensor configuration, which are tightly coupled with the
state estimation of all robots, limiting both flexibility and accuracy. To this
end, we fully leverage the high capacity of Unmanned Ground Vehicle (UGV) to
integrate multiple sensors, enabling a semi-distributed cross-modal air-ground
relative localization framework. In this work, both the UGV and the Unmanned
Aerial Vehicle (UAV) independently perform SLAM while extracting deep
learning-based keypoints and global descriptors, which decouples the relative
localization from the state estimation of all agents. The UGV employs a local
Bundle Adjustment (BA) with LiDAR, camera, and an IMU to rapidly obtain
accurate relative pose estimates. The BA process adopts sparse keypoint
optimization and is divided into two stages: First, optimizing camera poses
interpolated from LiDAR-Inertial Odometry (LIO), followed by estimating the
relative camera poses between the UGV and UAV. Additionally, we implement an
incremental loop closure detection algorithm using deep learning-based
descriptors to maintain and retrieve keyframes efficiently. Experimental
results demonstrate that our method achieves outstanding performance in both
accuracy and efficiency. Unlike traditional multi-robot SLAM approaches that
transmit images or point clouds, our method only transmits keypoint pixels and
their descriptors, effectively constraining the communication bandwidth under
0.3 Mbps. Codes and data will be publicly available on
https://github.com/Ascbpiac/cross-model-relative-localization.git.
comment: 7 pages, 3 figures. Accepted by IROS 2025
★ Physically-Grounded Goal Imagination: Physics-Informed Variational Autoencoder for Self-Supervised Reinforcement Learning
Self-supervised goal-conditioned reinforcement learning enables robots to
autonomously acquire diverse skills without human supervision. However, a
central challenge is the goal setting problem: robots must propose feasible and
diverse goals that are achievable in their current environment. Existing
methods like RIG (Visual Reinforcement Learning with Imagined Goals) use
variational autoencoder (VAE) to generate goals in a learned latent space but
have the limitation of producing physically implausible goals that hinder
learning efficiency. We propose Physics-Informed RIG (PI-RIG), which integrates
physical constraints directly into the VAE training process through a novel
Enhanced Physics-Informed Variational Autoencoder (Enhanced p3-VAE), enabling
the generation of physically consistent and achievable goals. Our key
innovation is the explicit separation of the latent space into physics
variables governing object dynamics and environmental factors capturing visual
appearance, while enforcing physical consistency through differential equation
constraints and conservation laws. This enables the generation of physically
consistent and achievable goals that respect fundamental physical principles
such as object permanence, collision constraints, and dynamic feasibility.
Through extensive experiments, we demonstrate that this physics-informed goal
generation significantly improves the quality of proposed goals, leading to
more effective exploration and better skill acquisition in visual robotic
manipulation tasks including reaching, pushing, and pick-and-place scenarios.
★ Programmable Telescopic Soft Pneumatic Actuators for Deployable and Shape Morphing Soft Robots
Soft Robotics presents a rich canvas for free-form and continuum devices
capable of exerting forces in any direction and transforming between arbitrary
configurations. However, there is no current way to tractably and directly
exploit the design freedom due to the curse of dimensionality. Parameterisable
sets of designs offer a pathway towards tractable, modular soft robotics that
appropriately harness the behavioural freeform of soft structures to create
rich embodied behaviours. In this work, we present a parametrised class of soft
actuators, Programmable Telescopic Soft Pneumatic Actuators (PTSPAs). PTSPAs
expand axially on inflation for deployable structures and manipulation in
challenging confined spaces. We introduce a parametric geometry generator to
customise actuator models from high-level inputs, and explore the new design
space through semi-automated experimentation and systematic exploration of key
parameters. Using it we characterise the actuators' extension/bending,
expansion, and stiffness and reveal clear relationships between key design
parameters and performance. Finally we demonstrate the application of the
actuators in a deployable soft quadruped whose legs deploy to walk, enabling
automatic adaptation to confined spaces. PTSPAs present new design paradigm for
deployable and shape morphing structures and wherever large length changes are
required.
comment: 8 pages, 10 figures, Submitted to Robosoft 2026
★ Rapidly Learning Soft Robot Control via Implicit Time-Stepping
With the explosive growth of rigid-body simulators, policy learning in
simulation has become the de facto standard for most rigid morphologies. In
contrast, soft robotic simulation frameworks remain scarce and are seldom
adopted by the soft robotics community. This gap stems partly from the lack of
easy-to-use, general-purpose frameworks and partly from the high computational
cost of accurately simulating continuum mechanics, which often renders policy
learning infeasible. In this work, we demonstrate that rapid soft robot policy
learning is indeed achievable via implicit time-stepping. Our simulator of
choice, DisMech, is a general-purpose, fully implicit soft-body simulator
capable of handling both soft dynamics and frictional contact. We further
introduce delta natural curvature control, a method analogous to delta joint
position control in rigid manipulators, providing an intuitive and effective
means of enacting control for soft robot learning. To highlight the benefits of
implicit time-stepping and delta curvature control, we conduct extensive
comparisons across four diverse soft manipulator tasks against one of the most
widely used soft-body frameworks, Elastica. With implicit time-stepping,
parallel stepping of 500 environments achieves up to 6x faster speeds for
non-contact cases and up to 40x faster for contact-rich scenarios. Finally, a
comprehensive sim-to-sim gap evaluation--training policies in one simulator and
evaluating them in another--demonstrates that implicit time-stepping provides a
rare free lunch: dramatic speedups achieved without sacrificing accuracy.
comment: Code: https://github.com/QuantuMope/dismech-rl
★ How Do VLAs Effectively Inherit from VLMs?
Vision-language-action (VLA) models hold the promise to attain generalizable
embodied control. To achieve this, a pervasive paradigm is to leverage the rich
vision-semantic priors of large vision-language models (VLMs). However, the
fundamental question persists: How do VLAs effectively inherit the prior
knowledge from VLMs? To address this critical question, we introduce a
diagnostic benchmark, GrinningFace, an emoji tabletop manipulation task where
the robot arm is asked to place objects onto printed emojis corresponding to
language instructions. This task design is particularly revealing -- knowledge
associated with emojis is ubiquitous in Internet-scale datasets used for VLM
pre-training, yet emojis themselves are largely absent from standard robotics
datasets. Consequently, they provide a clean proxy: successful task completion
indicates effective transfer of VLM priors to embodied control. We implement
this diagnostic task in both simulated environment and a real robot, and
compare various promising techniques for knowledge transfer. Specifically, we
investigate the effects of parameter-efficient fine-tuning, VLM freezing,
co-training, predicting discretized actions, and predicting latent actions.
Through systematic evaluation, our work not only demonstrates the critical
importance of preserving VLM priors for the generalization of VLA but also
establishes guidelines for future research in developing truly generalizable
embodied AI systems.
★ On Accurate and Robust Estimation of 3D and 2D Circular Center: Method and Application to Camera-Lidar Calibration
Circular targets are widely used in LiDAR-camera extrinsic calibration due to
their geometric consistency and ease of detection. However, achieving accurate
3D-2D circular center correspondence remains challenging. Existing methods
often fail due to decoupled 3D fitting and erroneous 2D ellipse-center
estimation. To address this, we propose a geometrically principled framework
featuring two innovations: (i) a robust 3D circle center estimator based on
conformal geometric algebra and RANSAC; and (ii) a chord-length variance
minimization method to recover the true 2D projected center, resolving its
dual-minima ambi- guity via homography validation or a quasi-RANSAC fallback.
Evaluated on synthetic and real-world datasets, our framework significantly
outperforms state-of-the-art approaches. It reduces extrinsic estimation error
and enables robust calibration across diverse sensors and target types,
including natural circular objects. Our code will be publicly released for
reproducibility.
♻ ★ Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions
Kaifeng Zhang, Shuo Sha, Hanxiao Jiang, Matthew Loper, Hyunjong Song, Guangyan Cai, Zhuo Xu, Xiaochen Hu, Changxi Zheng, Yunzhu Li
Robotic manipulation policies are advancing rapidly, but their direct
evaluation in the real world remains costly, time-consuming, and difficult to
reproduce, particularly for tasks involving deformable objects. Simulation
provides a scalable and systematic alternative, yet existing simulators often
fail to capture the coupled visual and physical complexity of soft-body
interactions. We present a real-to-sim policy evaluation framework that
constructs soft-body digital twins from real-world videos and renders robots,
objects, and environments with photorealistic fidelity using 3D Gaussian
Splatting. We validate our approach on representative deformable manipulation
tasks, including plush toy packing, rope routing, and T-block pushing,
demonstrating that simulated rollouts correlate strongly with real-world
execution performance and reveal key behavioral patterns of learned policies.
Our results suggest that combining physics-informed reconstruction with
high-quality rendering enables reproducible, scalable, and accurate evaluation
of robotic manipulation policies. Website: https://real2sim-eval.github.io/
comment: The first two authors contributed equally. Website:
https://real2sim-eval.github.io/
♻ ★ IMPACT: Behavioral Intention-aware Multimodal Trajectory Prediction with Adaptive Context Trimming
Jiawei Sun, Xibin Yue, Jiahui Li, Tianle Shen, Chengran Yuan, Shuo Sun, Sheng Guo, Quanyun Zhou, Marcelo H Ang Jr
While most prior research has focused on improving the precision of
multimodal trajectory predictions, the explicit modeling of multimodal
behavioral intentions (e.g., yielding, overtaking) remains relatively
underexplored. This paper proposes a unified framework that jointly predicts
both behavioral intentions and trajectories to enhance prediction accuracy,
interpretability, and efficiency. Specifically, we employ a shared context
encoder for both intention and trajectory predictions, thereby reducing
structural redundancy and information loss. Moreover, we address the lack of
ground-truth behavioral intention labels in mainstream datasets (Waymo,
Argoverse) by auto-labeling these datasets, thus advancing the community's
efforts in this direction. We further introduce a vectorized occupancy
prediction module that infers the probability of each map polyline being
occupied by the target vehicle's future trajectory. By leveraging these
intention and occupancy prediction priors, our method conducts dynamic,
modality-dependent pruning of irrelevant agents and map polylines in the
decoding stage, effectively reducing computational overhead and mitigating
noise from non-critical elements. Our approach ranks first among LiDAR-free
methods on the Waymo Motion Dataset and achieves first place on the Waymo
Interactive Prediction Dataset. Remarkably, even without model ensembling, our
single-model framework improves the soft mean average precision (softmAP) by 10
percent compared to the second-best method in the Waymo Interactive Prediction
Leaderboard. Furthermore, the proposed framework has been successfully deployed
on real vehicles, demonstrating its practical effectiveness in real-world
applications.
comment: accepted by IEEE Robotics and Automation Letters
♻ ★ A High-Speed Time-Optimal Trajectory Generation Strategy via a Two-layer Planning Model
MPC (Model predictive control)-based motion planning and trajectory
generation are essential in applications such as unmanned aerial vehicles,
robotic manipulators, and rocket control. However, the real-time implementation
of such optimization-based planning faces significant challenges arising from
non-convex problem structures and inherent limitations of nonlinear programming
-- notably the difficulty in guaranteeing solution quality and the
unpredictability of computation time. To improve robustness and computational
efficiency, this paper introduces a two-layer motion planning algorithm for
intelligent ground vehicles based on convex optimization. The proposed
algorithm iteratively constructs discrete optimal control subproblems with
small, fixed terminal times, referred to as planning cycles. Each planning
cycle is further solved within progressively constructed convex sets generated
by utilizing customized search algorithms. The entire solution to the original
problem is obtained by incrementally composing the solutions of these
subproblems. The proposed algorithm demonstrates enhanced reliability and
significantly reduced computation time. We support our approach with
theoretical analysis under practical assumptions and numerical experiments.
Comparative results with standard sequential convex programming (SCP) methods
demonstrate the superiority of our method -- include a significant improved
computational speed under dynamic environments while maintain a near optimal
final time.
♻ ★ Pure Vision Language Action (VLA) Models: A Comprehensive Survey
The emergence of Vision Language Action (VLA) models marks a paradigm shift
from traditional policy-based control to generalized robotics, reframing Vision
Language Models (VLMs) from passive sequence generators into active agents for
manipulation and decision-making in complex, dynamic environments. This survey
delves into advanced VLA methods, aiming to provide a clear taxonomy and a
systematic, comprehensive review of existing research. It presents a
comprehensive analysis of VLA applications across different scenarios and
classifies VLA approaches into several paradigms: autoregression-based,
diffusion-based, reinforcement-based, hybrid, and specialized methods; while
examining their motivations, core strategies, and implementations in detail. In
addition, foundational datasets, benchmarks, and simulation platforms are
introduced. Building on the current VLA landscape, the review further proposes
perspectives on key challenges and future directions to advance research in VLA
models and generalizable robotics. By synthesizing insights from over three
hundred recent studies, this survey maps the contours of this rapidly evolving
field and highlights the opportunities and challenges that will shape the
development of scalable, general-purpose VLA methods.
♻ ★ Whole-body motion planning and safety-critical control for aerial manipulation
Aerial manipulation combines the maneuverability of multirotors with the
dexterity of robotic arms to perform complex tasks in cluttered spaces. Yet
planning safe, dynamically feasible trajectories remains difficult due to
whole-body collision avoidance and the conservativeness of common geometric
abstractions such as bounding boxes or ellipsoids. We present a whole-body
motion planning and safety-critical control framework for aerial manipulators
built on superquadrics (SQs). Using an SQ-plus-proxy representation, we model
both the vehicle and obstacles with differentiable, geometry-accurate surfaces.
Leveraging this representation, we introduce a maximum-clearance planner that
fuses Voronoi diagrams with an equilibrium-manifold formulation to generate
smooth, collision-aware trajectories. We further design a safety-critical
controller that jointly enforces thrust limits and collision avoidance via
high-order control barrier functions. In simulation, our approach outperforms
sampling-based planners in cluttered environments, producing faster, safer, and
smoother trajectories and exceeding ellipsoid-based baselines in geometric
fidelity. Actual experiments on a physical aerial-manipulation platform confirm
feasibility and robustness, demonstrating consistent performance across
simulation and hardware settings. The video can be found at
https://youtu.be/hQYKwrWf1Ak.
comment: Submitted to 2026 IFAC World Congress with the Journal option
(MECHATRONICS)
♻ ★ A Step Toward World Models: A Survey on Robotic Manipulation
Autonomous agents are increasingly expected to operate in complex, dynamic,
and uncertain environments, performing tasks such as manipulation, navigation,
and decision-making. Achieving these capabilities requires agents to understand
the underlying mechanisms and dynamics of the world, moving beyond reactive
control or simple replication of observed states. This motivates the
development of world models as internal representations that encode
environmental states, capture dynamics, and support prediction, planning, and
reasoning. Despite growing interest, the definition, scope, architectures, and
essential capabilities of world models remain ambiguous. In this survey, we go
beyond prescribing a fixed definition and limiting our scope to methods
explicitly labeled as world models. Instead, we examine approaches that exhibit
the core capabilities of world models through a review of methods in robotic
manipulation. We analyze their roles across perception, prediction, and
control, identify key challenges and solutions, and distill the core
components, capabilities, and functions that a fully realized world model
should possess. Building on this analysis, we aim to motivate further
development toward generalizable and practical world models for robotics.
comment: 24 pages, 5 figures
♻ ★ Scalable Offline Metrics for Autonomous Driving IROS 2025
Real-world evaluation of perception-based planning models for robotic
systems, such as autonomous vehicles, can be safely and inexpensively conducted
offline, i.e. by computing model prediction error over a pre-collected
validation dataset with ground-truth annotations. However, extrapolating from
offline model performance to online settings remains a challenge. In these
settings, seemingly minor errors can compound and result in test-time
infractions or collisions. This relationship is understudied, particularly
across diverse closed-loop metrics and complex urban maneuvers. In this work,
we revisit this undervalued question in policy evaluation through an extensive
set of experiments across diverse conditions and metrics. Based on analysis in
simulation, we find an even worse correlation between offline and online
settings than reported by prior studies, casting doubts on the validity of
current evaluation practices and metrics for driving policies. Next, we bridge
the gap between offline and online evaluation. We investigate an offline metric
based on epistemic uncertainty, which aims to capture events that are likely to
cause errors in closed-loop settings. The resulting metric achieves over 13%
improvement in correlation compared to previous offline metrics. We further
validate the generalization of our findings beyond the simulation environment
in real-world settings, where even greater gains are observed.
comment: Accepted at IROS 2025 (IEEE/RSJ International Conference on
Intelligent Robots and Systems); typos corrected