Robotics
★ Shapes of Cognition for Computational Cognitive Modeling
Shapes of cognition is a new conceptual paradigm for the computational
cognitive modeling of Language-Endowed Intelligent Agents (LEIAs). Shapes are
remembered constellations of sensory, linguistic, conceptual, episodic, and
procedural knowledge that allow agents to cut through the complexity of real
life the same way as people do: by expecting things to be typical, recognizing
patterns, acting by habit, reasoning by analogy, satisficing, and generally
minimizing cognitive load to the degree situations permit. Atypical outcomes
are treated using shapes-based recovery methods, such as learning on the fly,
asking a human partner for help, or seeking an actionable, even if imperfect,
situational understanding. Although shapes is an umbrella term, it is not
vague: shapes-based modeling involves particular objectives, hypotheses,
modeling strategies, knowledge bases, and actual models of wide-ranging
phenomena, all implemented within a particular cognitive architecture. Such
specificity is needed both to vet our hypotheses and to achieve our practical
aims of building useful agent systems that are explainable, extensible, and
worthy of our trust, even in critical domains. However, although the LEIA
example of shapes-based modeling is specific, the principles can be applied
more broadly, giving new life to knowledge-based and hybrid AI.
★ HARMONIC: A Content-Centric Cognitive Robotic Architecture
Sanjay Oruganti, Sergei Nirenburg, Marjorie McShane, Jesse English, Michael K. Roberts, Christian Arndt, Carlos Gonzalez, Mingyo Seo, Luis Sentis
This paper introduces HARMONIC, a cognitive-robotic architecture designed for
robots in human-robotic teams. HARMONIC supports semantic perception
interpretation, human-like decision-making, and intentional language
communication. It addresses the issues of safety and quality of results; aims
to solve problems of data scarcity, explainability, and safety; and promotes
transparency and trust. Two proof-of-concept HARMONIC-based robotic systems are
demonstrated, each implemented in both a high-fidelity simulation environment
and on physical robotic platforms.
★ Safety Critical Model Predictive Control Using Discrete-Time Control Density Functions
This paper presents MPC-CDF, a new approach integrating control density
functions (CDFs) within a model predictive control (MPC) framework to ensure
safety-critical control in nonlinear dynamical systems. By using the dual
formulation of the navigation problem, we incorporate CDFs into the MPC
framework, ensuring both convergence and safety in a discrete-time setting.
These density functions are endowed with a physical interpretation, where the
associated measure signifies the occupancy of system trajectories. Leveraging
this occupancy-based perspective, we synthesize safety-critical controllers
using the proposed MPC-CDF framework. We illustrate the safety properties of
this framework using a unicycle model and compare it with a control barrier
function-based method. The efficacy of this approach is demonstrated in the
autonomous safe navigation of an underwater vehicle, which avoids complex and
arbitrary obstacles while achieving the desired level of safety.
★ Design and Control of a Perching Drone Inspired by the Prey-Capturing Mechanism of Venus Flytrap
The endurance and energy efficiency of drones remain critical challenges in
their design and operation. To extend mission duration, numerous studies
explored perching mechanisms that enable drones to conserve energy by
temporarily suspending flight. This paper presents a new perching drone that
utilizes an active flexible perching mechanism inspired by the rapid predation
mechanism of the Venus flytrap, achieving perching in less than 100 ms. The
proposed system is designed for high-speed adaptability to the perching
targets. The overall drone design is outlined, followed by the development and
validation of the biomimetic perching structure. To enhance the system
stability, a cascade extended high-gain observer (EHGO) based control method is
developed, which can estimate and compensate for the external disturbance in
real time. The experimental results demonstrate the adaptability of the
perching structure and the superiority of the cascaded EHGO in resisting wind
and perching disturbances.
★ Collaborative Loco-Manipulation for Pick-and-Place Tasks with Dynamic Reward Curriculum
We present a hierarchical RL pipeline for training one-armed legged robots to
perform pick-and-place (P&P) tasks end-to-end -- from approaching the payload
to releasing it at a target area -- in both single-robot and cooperative
dual-robot settings. We introduce a novel dynamic reward curriculum that
enables a single policy to efficiently learn long-horizon P&P operations by
progressively guiding the agents through payload-centered sub-objectives.
Compared to state-of-the-art approaches for long-horizon RL tasks, our method
improves training efficiency by 55% and reduces execution time by 18.6% in
simulation experiments. In the dual-robot case, we show that our policy enables
each robot to attend to different components of its observation space at
distinct task stages, promoting effective coordination via autonomous attention
shifts. We validate our method through real-world experiments using ANYmal D
platforms in both single- and dual-robot scenarios. To our knowledge, this is
the first RL pipeline that tackles the full scope of collaborative P&P with two
legged manipulators.
★ StageACT: Stage-Conditioned Imitation for Robust Humanoid Door Opening
Moonyoung Lee, Dong Ki Kim, Jai Krishna Bandi, Max Smith, Aileen Liao, Ali-akbar Agha-mohammadi, Shayegan Omidshafiei
Humanoid robots promise to operate in everyday human environments without
requiring modifications to the surroundings. Among the many skills needed,
opening doors is essential, as doors are the most common gateways in built
spaces and often limit where a robot can go. Door opening, however, poses
unique challenges as it is a long-horizon task under partial observability,
such as reasoning about the door's unobservable latch state that dictates
whether the robot should rotate the handle or push the door. This ambiguity
makes standard behavior cloning prone to mode collapse, yielding blended or
out-of-sequence actions. We introduce StageACT, a stage-conditioned imitation
learning framework that augments low-level policies with task-stage inputs.
This effective addition increases robustness to partial observability, leading
to higher success rates and shorter completion times. On a humanoid operating
in a real-world office environment, StageACT achieves a 55% success rate on
previously unseen doors, more than doubling the best baseline. Moreover, our
method supports intentional behavior guidance through stage prompting, enabling
recovery behaviors. These results highlight stage conditioning as a lightweight
yet powerful mechanism for long-horizon humanoid loco-manipulation.
comment: 7 pages
★ ROOM: A Physics-Based Continuum Robot Simulator for Photorealistic Medical Datasets Generation
Salvatore Esposito, Matías Mattamala, Daniel Rebain, Francis Xiatian Zhang, Kevin Dhaliwal, Mohsen Khadem, Subramanian Ramamoorthy
Continuum robots are advancing bronchoscopy procedures by accessing complex
lung airways and enabling targeted interventions. However, their development is
limited by the lack of realistic training and test environments: Real data is
difficult to collect due to ethical constraints and patient safety concerns,
and developing autonomy algorithms requires realistic imaging and physical
feedback. We present ROOM (Realistic Optical Observation in Medicine), a
comprehensive simulation framework designed for generating photorealistic
bronchoscopy training data. By leveraging patient CT scans, our pipeline
renders multi-modal sensor data including RGB images with realistic noise and
light specularities, metric depth maps, surface normals, optical flow and point
clouds at medically relevant scales. We validate the data generated by ROOM in
two canonical tasks for medical robotics -- multi-view pose estimation and
monocular depth estimation, demonstrating diverse challenges that
state-of-the-art methods must overcome to transfer to these medical settings.
Furthermore, we show that the data produced by ROOM can be used to fine-tune
existing depth estimation models to overcome these challenges, also enabling
other downstream applications such as navigation. We expect that ROOM will
enable large-scale data generation across diverse patient anatomies and
procedural scenarios that are challenging to capture in clinical settings. Code
and data: https://github.com/iamsalvatore/room.
★ TeraSim-World: Worldwide Safety-Critical Data Synthesis for End-to-End Autonomous Driving
Safe and scalable deployment of end-to-end (E2E) autonomous driving requires
extensive and diverse data, particularly safety-critical events. Existing data
are mostly generated from simulators with a significant sim-to-real gap or
collected from on-road testing that is costly and unsafe. This paper presents
TeraSim-World, an automated pipeline that synthesizes realistic and
geographically diverse safety-critical data for E2E autonomous driving at
anywhere in the world. Starting from an arbitrary location, TeraSim-World
retrieves real-world maps and traffic demand from geospatial data sources.
Then, it simulates agent behaviors from naturalistic driving datasets, and
orchestrates diverse adversities to create corner cases. Informed by street
views of the same location, it achieves photorealistic, geographically grounded
sensor rendering via the frontier video generation model Cosmos-Drive. By
bridging agent and sensor simulations, TeraSim-World provides a scalable and
critical~data synthesis framework for training and evaluation of E2E autonomous
driving systems.
comment: 8 pages, 6 figures. Codes and videos are available at
https://wjiawei.com/terasim-world-web/
★ An Uncertainty-Weighted Decision Transformer for Navigation in Dense, Complex Driving Scenarios
Autonomous driving in dense, dynamic environments requires decision-making
systems that can exploit both spatial structure and long-horizon temporal
dependencies while remaining robust to uncertainty. This work presents a novel
framework that integrates multi-channel bird's-eye-view occupancy grids with
transformer-based sequence modeling for tactical driving in complex roundabout
scenarios. To address the imbalance between frequent low-risk states and rare
safety-critical decisions, we propose the Uncertainty-Weighted Decision
Transformer (UWDT). UWDT employs a frozen teacher transformer to estimate
per-token predictive entropy, which is then used as a weight in the student
model's loss function. This mechanism amplifies learning from uncertain,
high-impact states while maintaining stability across common low-risk
transitions. Experiments in a roundabout simulator, across varying traffic
densities, show that UWDT consistently outperforms other baselines in terms of
reward, collision rate, and behavioral stability. The results demonstrate that
uncertainty-aware, spatial-temporal transformers can deliver safer and more
efficient decision-making for autonomous driving in complex traffic
environments.
★ Hydrosoft: Non-Holonomic Hydroelastic Models for Compliant Tactile Manipulation
Tactile sensors have long been valued for their perceptual capabilities,
offering rich insights into the otherwise hidden interface between the robot
and grasped objects. Yet their inherent compliance -- a key driver of
force-rich interactions -- remains underexplored. The central challenge is to
capture the complex, nonlinear dynamics introduced by these passive-compliant
elements. Here, we present a computationally efficient non-holonomic
hydroelastic model that accurately models path-dependent contact force
distributions and dynamic surface area variations. Our insight is to extend the
object's state space, explicitly incorporating the distributed forces generated
by the compliant sensor. Our differentiable formulation not only accounts for
path-dependent behavior but also enables gradient-based trajectory
optimization, seamlessly integrating with high-resolution tactile feedback. We
demonstrate the effectiveness of our approach across a range of simulated and
real-world experiments and highlight the importance of modeling the path
dependence of sensor dynamics.
★ Model Predictive Control with Reference Learning for Soft Robotic Intracranial Pressure Waveform Modulation
This paper introduces a learning-based control framework for a soft robotic
actuator system designed to modulate intracranial pressure (ICP) waveforms,
which is essential for studying cerebrospinal fluid dynamics and pathological
processes underlying neurological disorders. A two-layer framework is proposed
to safely achieve a desired ICP waveform modulation. First, a model predictive
controller (MPC) with a disturbance observer is used for offset-free tracking
of the system's motor position reference trajectory under safety constraints.
Second, to address the unknown nonlinear dependence of ICP on the motor
position, we employ a Bayesian optimization (BO) algorithm used for online
learning of a motor position reference trajectory that yields the desired ICP
modulation. The framework is experimentally validated using a test bench with a
brain phantom that replicates realistic ICP dynamics in vitro. Compared to a
previously employed proportional-integral-derivative controller, the MPC
reduces mean and maximum motor position reference tracking errors by 83 % and
73 %, respectively. In less than 20 iterations, the BO algorithm learns a motor
position reference trajectory that yields an ICP waveform with the desired mean
and amplitude.
★ Empowering Multi-Robot Cooperation via Sequential World Models
Model-based reinforcement learning (MBRL) has shown significant potential in
robotics due to its high sample efficiency and planning capability. However,
extending MBRL to multi-robot cooperation remains challenging due to the
complexity of joint dynamics. To address this, we propose the Sequential World
Model (SeqWM), a novel framework that integrates the sequential paradigm into
model-based multi-agent reinforcement learning. SeqWM employs independent,
sequentially structured agent-wise world models to decompose complex joint
dynamics. Latent rollouts and decision-making are performed through sequential
communication, where each agent generates its future trajectory and plans its
actions based on the predictions of its predecessors. This design enables
explicit intention sharing, enhancing cooperative performance, and reduces
communication overhead to linear complexity. Results in challenging simulated
environments (Bi-DexHands and Multi-Quad) show that SeqWM outperforms existing
state-of-the-art model-free and model-based baselines in both overall
performance and sample efficiency, while exhibiting advanced cooperative
behaviors such as predictive adaptation and role division. Furthermore, SeqWM
has been success fully deployed on physical quadruped robots, demonstrating its
effectiveness in real-world multi-robot systems. Demos and code are available
at: https://github.com/zhaozijie2022/seqwm-marl
★ A Synthetic Data Pipeline for Supporting Manufacturing SMEs in Visual Assembly Control
Quality control of assembly processes is essential in manufacturing to ensure
not only the quality of individual components but also their proper integration
into the final product. To assist in this matter, automated assembly control
using computer vision methods has been widely implemented. However, the costs
associated with image acquisition, annotation, and training of computer vision
algorithms pose challenges for integration, especially for small- and
medium-sized enterprises (SMEs), which often lack the resources for extensive
training, data collection, and manual image annotation. Synthetic data offers
the potential to reduce manual data collection and labeling. Nevertheless, its
practical application in the context of assembly quality remains limited. In
this work, we present a novel approach for easily integrable and data-efficient
visual assembly control. Our approach leverages simulated scene generation
based on computer-aided design (CAD) data and object detection algorithms. The
results demonstrate a time-saving pipeline for generating image data in
manufacturing environments, achieving a mean Average Precision (mAP@0.5:0.95)
up to 99,5% for correctly identifying instances of synthetic planetary gear
system components within our simulated training data, and up to 93% when
transferred to real-world camera-captured testing data. This research
highlights the effectiveness of synthetic data generation within an adaptable
pipeline and underscores its potential to support SMEs in implementing
resource-efficient visual assembly control solutions.
★ A Design Co-Pilot for Task-Tailored Manipulators
Although robotic manipulators are used in an ever-growing range of
applications, robot manufacturers typically follow a ``one-fits-all''
philosophy, employing identical manipulators in various settings. This often
leads to suboptimal performance, as general-purpose designs fail to exploit
particularities of tasks. The development of custom, task-tailored robots is
hindered by long, cost-intensive development cycles and the high cost of
customized hardware. Recently, various computational design methods have been
devised to overcome the bottleneck of human engineering. In addition, a surge
of modular robots allows quick and economical adaptation to changing industrial
settings. This work proposes an approach to automatically designing and
optimizing robot morphologies tailored to a specific environment. To this end,
we learn the inverse kinematics for a wide range of different manipulators. A
fully differentiable framework realizes gradient-based fine-tuning of designed
robots and inverse kinematics solutions. Our generative approach accelerates
the generation of specialized designs from hours with optimization-based
methods to seconds, serving as a design co-pilot that enables instant
adaptation and effective human-AI collaboration. Numerical experiments show
that our approach finds robots that can navigate cluttered environments,
manipulators that perform well across a specified workspace, and can be adapted
to different hardware constraints. Finally, we demonstrate the real-world
applicability of our method by setting up a modular robot designed in
simulation that successfully moves through an obstacle course.
★ Beyond Anthropomorphism: Enhancing Grasping and Eliminating a Degree of Freedom by Fusing the Abduction of Digits Four and Five
Simon Fritsch, Liam Achenbach, Riccardo Bianco, Nicola Irmiger, Gawain Marti, Samuel Visca, Chenyu Yang, Davide Liconti, Barnabas Gavin Cangan, Robert Jomar Malate, Ronan J. Hinchet, Robert K. Katzschmann
This paper presents the SABD hand, a 16-degree-of-freedom (DoF) robotic hand
that departs from purely anthropomorphic designs to achieve an expanded grasp
envelope, enable manipulation poses beyond human capability, and reduce the
required number of actuators. This is achieved by combining the
adduction/abduction (Add/Abd) joint of digits four and five into a single joint
with a large range of motion. The combined joint increases the workspace of the
digits by 400\% and reduces the required DoFs while retaining dexterity.
Experimental results demonstrate that the combined Add/Abd joint enables the
hand to grasp objects with a side distance of up to 200 mm. Reinforcement
learning-based investigations show that the design enables grasping policies
that are effective not only for handling larger objects but also for achieving
enhanced grasp stability. In teleoperated trials, the hand successfully
performed 86\% of attempted grasps on suitable YCB objects, including
challenging non-anthropomorphic configurations. These findings validate the
design's ability to enhance grasp stability, flexibility, and dexterous
manipulation without added complexity, making it well-suited for a wide range
of applications.
comment: First five listed authors have equal contribution
★ Practical Handling of Dynamic Environments in Decentralised Multi-Robot Patrol
Persistent monitoring using robot teams is of interest in fields such as
security, environmental monitoring, and disaster recovery. Performing such
monitoring in a fully on-line decentralised fashion has significant potential
advantages for robustness, adaptability, and scalability of monitoring
solutions, including, in principle, the capacity to effectively adapt in
real-time to a changing environment. We examine this through the lens of
multi-robot patrol, in which teams of patrol robots must persistently minimise
time between visits to points of interest, within environments where
traversability of routes is highly dynamic. These dynamics must be observed by
patrol agents and accounted for in a fully decentralised on-line manner. In
this work, we present a new method of monitoring and adjusting for environment
dynamics in a decentralised multi-robot patrol team. We demonstrate that our
method significantly outperforms realistic baselines in highly dynamic
scenarios, and also investigate dynamic scenarios in which explicitly
accounting for environment dynamics may be unnecessary or impractical.
★ DVDP: An End-to-End Policy for Mobile Robot Visual Docking with RGB-D Perception
Automatic docking has long been a significant challenge in the field of
mobile robotics. Compared to other automatic docking methods, visual docking
methods offer higher precision and lower deployment costs, making them an
efficient and promising choice for this task. However, visual docking methods
impose strict requirements on the robot's initial position at the start of the
docking process. To overcome the limitations of current vision-based methods,
we propose an innovative end-to-end visual docking method named DVDP(direct
visual docking policy). This approach requires only a binocular RGB-D camera
installed on the mobile robot to directly output the robot's docking path,
achieving end-to-end automatic docking. Furthermore, we have collected a
large-scale dataset of mobile robot visual automatic docking dataset through a
combination of virtual and real environments using the Unity 3D platform and
actual mobile robot setups. We developed a series of evaluation metrics to
quantify the performance of the end-to-end visual docking method. Extensive
experiments, including benchmarks against leading perception backbones adapted
into our framework, demonstrate that our method achieves superior performance.
Finally, real-world deployment on the SCOUT Mini confirmed DVDP's efficacy,
with our model generating smooth, feasible docking trajectories that meet
physical constraints and reach the target pose.
★ Out of Distribution Detection in Self-adaptive Robots with AI-powered Digital Twins
Erblin Isaku, Hassan Sartaj, Shaukat Ali, Beatriz Sanguino, Tongtong Wang, Guoyuan Li, Houxiang Zhang, Thomas Peyrucain
Self-adaptive robots (SARs) in complex, uncertain environments must
proactively detect and address abnormal behaviors, including
out-of-distribution (OOD) cases. To this end, digital twins offer a valuable
solution for OOD detection. Thus, we present a digital twin-based approach for
OOD detection (ODiSAR) in SARs. ODiSAR uses a Transformer-based digital twin to
forecast SAR states and employs reconstruction error and Monte Carlo dropout
for uncertainty quantification. By combining reconstruction error with
predictive variance, the digital twin effectively detects OOD behaviors, even
in previously unseen conditions. The digital twin also includes an
explainability layer that links potential OOD to specific SAR states, offering
insights for self-adaptation. We evaluated ODiSAR by creating digital twins of
two industrial robots: one navigating an office environment, and another
performing maritime ship navigation. In both cases, ODiSAR forecasts SAR
behaviors (i.e., robot trajectories and vessel motion) and proactively detects
OOD events. Our results showed that ODiSAR achieved high detection performance
-- up to 98\% AUROC, 96\% TNR@TPR95, and 95\% F1-score -- while providing
interpretable insights to support self-adaptation.
comment: 15 pages, 4 figures, 3 tables
★ Tendon-Based Proprioception in an Anthropomorphic Underactuated Robotic Hand with Series Elastic Actuators
Anthropomorphic underactuated hands are widely employed for their versatility
and structural simplicity. In such systems, compact sensing integration and
proper interpretation aligned with underactuation are crucial for realizing
practical grasp functionalities. This study proposes an anthropomorphic
underactuated hand that achieves comprehensive situational awareness of
hand-object interaction, utilizing tendon-based proprioception provided by
series elastic actuators (SEAs). We developed a compact SEA with high accuracy
and reliability that can be seamlessly integrated into sensorless fingers. By
coupling proprioceptive sensing with potential energy-based modeling, the
system estimates key grasp-related variables, including contact timing, joint
angles, relative object stiffness, and finger configuration changes indicating
external disturbances. These estimated variables enable grasp posture
reconstruction, safe handling of deformable objects, and blind grasping with
proprioceptive-only recognition of objects with varying geometry and stiffness.
Finger-level experiments and hand-level demonstrations confirmed the
effectiveness of the proposed approach. The results demonstrate that
tendon-based proprioception serves as a compact and robust sensing modality for
practical manipulation without reliance on vision or tactile feedback.
comment: 8 pages, 10 figures, Supplementary video, Submitted to IEEE Robotics
and Automation Letters (RA-L)
★ Spatiotemporal Calibration for Laser Vision Sensor in Hand-eye System Based on Straight-line Constraint
Laser vision sensors (LVS) are critical perception modules for industrial
robots, facilitating real-time acquisition of workpiece geometric data in
welding applications. However, the camera communication delay will lead to a
temporal desynchronization between captured images and the robot motions.
Additionally, hand-eye extrinsic parameters may vary during prolonged
measurement. To address these issues, we introduce a measurement model of LVS
considering the effect of the camera's time-offset and propose a teaching-free
spatiotemporal calibration method utilizing line constraints. This method
involves a robot equipped with an LVS repeatedly scanning straight-line fillet
welds using S-shaped trajectories. Regardless of the robot's orientation
changes, all measured welding positions are constrained to a straight-line,
represented by Plucker coordinates. Moreover, a nonlinear optimization model
based on straight-line constraints is established. Subsequently, the
Levenberg-Marquardt algorithm (LMA) is employed to optimize parameters,
including time-offset, hand-eye extrinsic parameters, and straight-line
parameters. The feasibility and accuracy of the proposed approach are
quantitatively validated through experiments on curved weld scanning. We
open-sourced the code, dataset, and simulation report at
https://anonymous.4open.science/r/LVS_ST_CALIB-015F/README.md.
comment: Submitted to IEEE RAL
★ Spotting the Unfriendly Robot - Towards better Metrics for Interactions ICRA
Establishing standardized metrics for Social Robot Navigation (SRN)
algorithms for assessing the quality and social compliance of robot behavior
around humans is essential for SRN research. Currently, commonly used
evaluation metrics lack the ability to quantify how cooperative an agent
behaves in interaction with humans. Concretely, in a simple frontal approach
scenario, no metric specifically captures if both agents cooperate or if one
agent stays on collision course and the other agent is forced to evade. To
address this limitation, we propose two new metrics, a conflict intensity
metric and the responsibility metric. Together, these metrics are capable of
evaluating the quality of human-robot interactions by showing how much a given
algorithm has contributed to reducing a conflict and which agent actually took
responsibility of the resolution. This work aims to contribute to the
development of a comprehensive and standardized evaluation methodology for SRN,
ultimately enhancing the safety, efficiency, and social acceptance of robots in
human-centric environments.
comment: Presented at 2025 IEEE Conference on Robotics and Automation (ICRA)
Workshop: Advances in Social Navigation: Planning, HRI and Beyond
★ Responsibility and Engagement - Evaluating Interactions in Social Robot Navigation ICRA
In Social Robot Navigation (SRN), the availability of meaningful metrics is
crucial for evaluating trajectories from human-robot interactions. In the SRN
context, such interactions often relate to resolving conflicts between two or
more agents. Correspondingly, the shares to which agents contribute to the
resolution of such conflicts are important. This paper builds on recent work,
which proposed a Responsibility metric capturing such shares. We extend this
framework in two directions: First, we model the conflict buildup phase by
introducing a time normalization. Second, we propose the related Engagement
metric, which captures how the agents' actions intensify a conflict. In a
comprehensive series of simulated scenarios with dyadic, group and crowd
interactions, we show that the metrics carry meaningful information about the
cooperative resolution of conflicts in interactions. They can be used to assess
behavior quality and foresightedness. We extensively discuss applicability,
design choices and limitations of the proposed metrics.
comment: under review for 2026 IEEE International Conference on Robotics &
Automation (ICRA)
★ Towards Context-Aware Human-like Pointing Gestures with RL Motion Imitation
Pointing is a key mode of interaction with robots, yet most prior work has
focused on recognition rather than generation. We present a motion capture
dataset of human pointing gestures covering diverse styles, handedness, and
spatial targets. Using reinforcement learning with motion imitation, we train
policies that reproduce human-like pointing while maximizing precision. Results
show our approach enables context-aware pointing behaviors in simulation,
balancing task performance with natural dynamics.
comment: Presented at the Context-Awareness in HRI (CONAWA) Workshop, ACM/IEEE
International Conference on Human-Robot Interaction (HRI 2022), March 7, 2022
★ GRATE: a Graph transformer-based deep Reinforcement learning Approach for Time-efficient autonomous robot Exploration
Autonomous robot exploration (ARE) is the process of a robot autonomously
navigating and mapping an unknown environment. Recent Reinforcement Learning
(RL)-based approaches typically formulate ARE as a sequential decision-making
problem defined on a collision-free informative graph. However, these methods
often demonstrate limited reasoning ability over graph-structured data.
Moreover, due to the insufficient consideration of robot motion, the resulting
RL policies are generally optimized to minimize travel distance, while
neglecting time efficiency. To overcome these limitations, we propose GRATE, a
Deep Reinforcement Learning (DRL)-based approach that leverages a Graph
Transformer to effectively capture both local structure patterns and global
contextual dependencies of the informative graph, thereby enhancing the model's
reasoning capability across the entire environment. In addition, we deploy a
Kalman filter to smooth the waypoint outputs, ensuring that the resulting path
is kinodynamically feasible for the robot to follow. Experimental results
demonstrate that our method exhibits better exploration efficiency (up to 21.5%
in distance and 21.3% in time to complete exploration) than state-of-the-art
conventional and learning-based baselines in various simulation benchmarks. We
also validate our planner in real-world scenarios.
★ Contrastive Representation Learning for Robust Sim-to-Real Transfer of Adaptive Humanoid Locomotion
Reinforcement learning has produced remarkable advances in humanoid
locomotion, yet a fundamental dilemma persists for real-world deployment:
policies must choose between the robustness of reactive proprioceptive control
or the proactivity of complex, fragile perception-driven systems. This paper
resolves this dilemma by introducing a paradigm that imbues a purely
proprioceptive policy with proactive capabilities, achieving the foresight of
perception without its deployment-time costs. Our core contribution is a
contrastive learning framework that compels the actor's latent state to encode
privileged environmental information from simulation. Crucially, this
``distilled awareness" empowers an adaptive gait clock, allowing the policy to
proactively adjust its rhythm based on an inferred understanding of the
terrain. This synergy resolves the classic trade-off between rigid, clocked
gaits and unstable clock-free policies. We validate our approach with zero-shot
sim-to-real transfer to a full-sized humanoid, demonstrating highly robust
locomotion over challenging terrains, including 30 cm high steps and 26.5{\deg}
slopes, proving the effectiveness of our method. Website:
https://lu-yidan.github.io/cra-loco.
★ A Novel Skill Modeling Approach: Integrating Vergnaud's Scheme with Cognitive Architectures
Human-machine interaction is increasingly important in industry, and this
trend will only intensify with the rise of Industry 5.0. Human operators have
skills that need to be adapted when using machines to achieve the best results.
It is crucial to highlight the operator's skills and understand how they use
and adapt them [18]. A rigorous description of these skills is necessary to
compare performance with and without robot assistance. Predicate logic, used by
Vergnaud within Piaget's scheme concept, offers a promising approach. However,
this theory doesn't account for cognitive system constraints, such as the
timing of actions, the limitation of cognitive resources, the parallelization
of tasks, or the activation of automatic gestures contrary to optimal
knowledge. Integrating these constraints is essential for representing agent
skills understanding skill transfer between biological and mechanical
structures. Cognitive architectures models [2] address these needs by
describing cognitive structure and can be combined with the scheme for mutual
benefit. Welding provides a relevant case study, as it highlights the
challenges faced by operators, even highly skilled ones. Welding's complexity
stems from the need for constant skill adaptation to variable parameters like
part position and process. This adaptation is crucial, as weld quality, a key
factor, is only assessed afterward via destructive testing. Thus, the welder is
confronted with a complex perception-decision-action cycle, where the
evaluation of the impact of his actions is delayed and where errors are
definitive. This dynamic underscores the importance of understanding and
modeling the skills of operators.
★ Unleashing the Power of Discrete-Time State Representation: Ultrafast Target-based IMU-Camera Spatial-Temporal Calibration
Visual-inertial fusion is crucial for a large amount of intelligent and
autonomous applications, such as robot navigation and augmented reality. To
bootstrap and achieve optimal state estimation, the spatial-temporal
displacements between IMU and cameras must be calibrated in advance. Most
existing calibration methods adopt continuous-time state representation, more
specifically the B-spline. Despite these methods achieve precise
spatial-temporal calibration, they suffer from high computational cost caused
by continuous-time state representation. To this end, we propose a novel and
extremely efficient calibration method that unleashes the power of
discrete-time state representation. Moreover, the weakness of discrete-time
state representation in temporal calibration is tackled in this paper. With the
increasing production of drones, cellphones and other visual-inertial
platforms, if one million devices need calibration around the world, saving one
minute for the calibration of each device means saving 2083 work days in total.
To benefit both the research and industry communities, our code will be
open-source.
★ Multi-Robot Task Planning for Multi-Object Retrieval Tasks with Distributed On-Site Knowledge via Large Language Models
Kento Murata, Shoichi Hasegawa, Tomochika Ishikawa, Yoshinobu Hagiwara, Akira Taniguchi, Lotfi El Hafi, Tadahiro Taniguchi
It is crucial to efficiently execute instructions such as "Find an apple and
a banana" or "Get ready for a field trip," which require searching for multiple
objects or understanding context-dependent commands. This study addresses the
challenging problem of determining which robot should be assigned to which part
of a task when each robot possesses different situational on-site
knowledge-specifically, spatial concepts learned from the area designated to it
by the user. We propose a task planning framework that leverages large language
models (LLMs) and spatial concepts to decompose natural language instructions
into subtasks and allocate them to multiple robots. We designed a novel
few-shot prompting strategy that enables LLMs to infer required objects from
ambiguous commands and decompose them into appropriate subtasks. In our
experiments, the proposed method achieved 47/50 successful assignments,
outperforming random (28/50) and commonsense-based assignment (26/50).
Furthermore, we conducted qualitative evaluations using two actual mobile
manipulators. The results demonstrated that our framework could handle
instructions, including those involving ad hoc categories such as "Get ready
for a field trip," by successfully performing task decomposition, assignment,
sequential planning, and execution.
comment: Submitted to AROB-ISBC 2026 (Journal Track option)
★ Bridging Perception and Planning: Towards End-to-End Planning for Signal Temporal Logic Tasks
We investigate the task and motion planning problem for Signal Temporal Logic
(STL) specifications in robotics. Existing STL methods rely on pre-defined maps
or mobility representations, which are ineffective in unstructured real-world
environments. We propose the \emph{Structured-MoE STL Planner}
(\textbf{S-MSP}), a differentiable framework that maps synchronized multi-view
camera observations and an STL specification directly to a feasible trajectory.
S-MSP integrates STL constraints within a unified pipeline, trained with a
composite loss that combines trajectory reconstruction and STL robustness. A
\emph{structure-aware} Mixture-of-Experts (MoE) model enables horizon-aware
specialization by projecting sub-tasks into temporally anchored embeddings. We
evaluate S-MSP using a high-fidelity simulation of factory-logistics scenarios
with temporally constrained tasks. Experiments show that S-MSP outperforms
single-expert baselines in STL satisfaction and trajectory feasibility. A
rule-based \emph{safety filter} at inference improves physical executability
without compromising logical correctness, showcasing the practicality of the
approach.
★ Integrating Trajectory Optimization and Reinforcement Learning for Quadrupedal Jumping with Terrain-Adaptive Landing IROS 2025
Jumping constitutes an essential component of quadruped robots' locomotion
capabilities, which includes dynamic take-off and adaptive landing. Existing
quadrupedal jumping studies mainly focused on the stance and flight phase by
assuming a flat landing ground, which is impractical in many real world cases.
This work proposes a safe landing framework that achieves adaptive landing on
rough terrains by combining Trajectory Optimization (TO) and Reinforcement
Learning (RL) together. The RL agent learns to track the reference motion
generated by TO in the environments with rough terrains. To enable the learning
of compliant landing skills on challenging terrains, a reward relaxation
strategy is synthesized to encourage exploration during landing recovery
period. Extensive experiments validate the accurate tracking and safe landing
skills benefiting from our proposed method in various scenarios.
comment: Accepted by IROS 2025
★ Toward Ownership Understanding of Objects: Active Question Generation with Large Language Model and Probabilistic Generative Model
Saki Hashimoto, Shoichi Hasegawa, Tomochika Ishikawa, Akira Taniguchi, Yoshinobu Hagiwara, Lotfi El Hafi, Tadahiro Taniguchi
Robots operating in domestic and office environments must understand object
ownership to correctly execute instructions such as ``Bring me my cup.''
However, ownership cannot be reliably inferred from visual features alone. To
address this gap, we propose Active Ownership Learning (ActOwL), a framework
that enables robots to actively generate and ask ownership-related questions to
users. ActOwL employs a probabilistic generative model to select questions that
maximize information gain, thereby acquiring ownership knowledge efficiently to
improve learning efficiency. Additionally, by leveraging commonsense knowledge
from Large Language Models (LLM), objects are pre-classified as either shared
or owned, and only owned objects are targeted for questioning. Through
experiments in a simulated home environment and a real-world laboratory
setting, ActOwL achieved significantly higher ownership clustering accuracy
with fewer questions than baseline methods. These findings demonstrate the
effectiveness of combining active inference with LLM-guided commonsense
reasoning, advancing the capability of robots to acquire ownership knowledge
for practical and socially appropriate task execution.
comment: Submitted to AROB-ISBC 2026 (Journal Track option)
★ NavMoE: Hybrid Model- and Learning-based Traversability Estimation for Local Navigation via Mixture of Experts
Botao He, Amir Hossein Shahidzadeh, Yu Chen, Jiayi Wu, Tianrui Guan, Guofei Chen, Howie Choset, Dinesh Manocha, Glen Chou, Cornelia Fermuller, Yiannis Aloimonos
This paper explores traversability estimation for robot navigation. A key
bottleneck in traversability estimation lies in efficiently achieving reliable
and robust predictions while accurately encoding both geometric and semantic
information across diverse environments. We introduce Navigation via Mixture of
Experts (NAVMOE), a hierarchical and modular approach for traversability
estimation and local navigation. NAVMOE combines multiple specialized models
for specific terrain types, each of which can be either a classical model-based
or a learning-based approach that predicts traversability for specific terrain
types. NAVMOE dynamically weights the contributions of different models based
on the input environment through a gating network. Overall, our approach offers
three advantages: First, NAVMOE enables traversability estimation to adaptively
leverage specialized approaches for different terrains, which enhances
generalization across diverse and unseen environments. Second, our approach
significantly improves efficiency with negligible cost of solution quality by
introducing a training-free lazy gating mechanism, which is designed to
minimize the number of activated experts during inference. Third, our approach
uses a two-stage training strategy that enables the training for the gating
networks within the hybrid MoE method that contains nondifferentiable modules.
Extensive experiments show that NAVMOE delivers a better efficiency and
performance balance than any individual expert or full ensemble across
different domains, improving cross- domain generalization and reducing average
computational cost by 81.2% via lazy gating, with less than a 2% loss in path
quality.
★ Force-Modulated Visual Policy for Robot-Assisted Dressing with Arm Motions
Robot-assisted dressing has the potential to significantly improve the lives
of individuals with mobility impairments. To ensure an effective and
comfortable dressing experience, the robot must be able to handle challenging
deformable garments, apply appropriate forces, and adapt to limb movements
throughout the dressing process. Prior work often makes simplifying assumptions
-- such as static human limbs during dressing -- which limits real-world
applicability. In this work, we develop a robot-assisted dressing system
capable of handling partial observations with visual occlusions, as well as
robustly adapting to arm motions during the dressing process. Given a policy
trained in simulation with partial observations, we propose a method to
fine-tune it in the real world using a small amount of data and multi-modal
feedback from vision and force sensing, to further improve the policy's
adaptability to arm motions and enhance safety. We evaluate our method in
simulation with simplified articulated human meshes and in a real world human
study with 12 participants across 264 dressing trials. Our policy successfully
dresses two long-sleeve everyday garments onto the participants while being
adaptive to various kinds of arm motions, and greatly outperforms prior
baselines in terms of task completion and user feedback. Video are available at
https://dressing-motion.github.io/.
comment: CoRL 2025
★ Deep Generative and Discriminative Digital Twin endowed with Variational Autoencoder for Unsupervised Predictive Thermal Condition Monitoring of Physical Robots in Industry 6.0 and Society 6.0
Robots are unrelentingly used to achieve operational efficiency in Industry
4.0 along with symbiotic and sustainable assistance for the work-force in
Industry 5.0. As resilience, robustness, and well-being are required in
anti-fragile manufacturing and human-centric societal tasks, an autonomous
anticipation and adaption to thermal saturation and burns due to motors
overheating become instrumental for human safety and robot availability. Robots
are thereby expected to self-sustain their performance and deliver user
experience, in addition to communicating their capability to other agents in
advance to ensure fully automated thermally feasible tasks, and prolong their
lifetime without human intervention. However, the traditional robot shutdown,
when facing an imminent thermal saturation, inhibits productivity in factories
and comfort in the society, while cooling strategies are hard to implement
after the robot acquisition. In this work, smart digital twins endowed with
generative AI, i.e., variational autoencoders, are leveraged to manage
thermally anomalous and generate uncritical robot states. The notion of thermal
difficulty is derived from the reconstruction error of variational
autoencoders. A robot can use this score to predict, anticipate, and share the
thermal feasibility of desired motion profiles to meet requirements from
emerging applications in Industry 6.0 and Society 6.0.
comment: $\copyright$ 2025 the authors. This work has been accepted to the to
the 10th IFAC Symposium on Mechatronic Systems & 14th IFAC Symposium on
Robotics July 15-18, 2025 || Paris, France for publication under a Creative
Commons Licence CC-BY-NC-ND
★ Deep Learning for Model-Free Prediction of Thermal States of Robot Joint Motors
In this work, deep neural networks made up of multiple hidden Long Short-Term
Memory (LSTM) and Feedforward layers are trained to predict the thermal
behavior of the joint motors of robot manipulators. A model-free and scalable
approach is adopted. It accommodates complexity and uncertainty challenges
stemming from the derivation, identification, and validation of a large number
of parameters of an approximation model that is hardly available. To this end,
sensed joint torques are collected and processed to foresee the thermal
behavior of joint motors. Promising prediction results of the machine learning
based capture of the temperature dynamics of joint motors of a redundant robot
with seven joints are presented.
comment: $\copyright$ 2025 the authors. This work has been accepted to the
10th IFAC Symposium on Mechatronic Systems & 14th IFAC Symposium on Robotics
July 15-18, 2025 || Paris, France for publication under a Creative Commons
Licence CC-BY-NC-ND
★ NAMOUnc: Navigation Among Movable Obstacles with Decision Making on Uncertainty Interval
Navigation among movable obstacles (NAMO) is a critical task in robotics,
often challenged by real-world uncertainties such as observation noise, model
approximations, action failures, and partial observability. Existing solutions
frequently assume ideal conditions, leading to suboptimal or risky decisions.
This paper introduces NAMOUnc, a novel framework designed to address these
uncertainties by integrating them into the decision-making process. We first
estimate them and compare the corresponding time cost intervals for removing
and bypassing obstacles, optimizing both the success rate and time efficiency,
ensuring safer and more efficient navigation. We validate our method through
extensive simulations and real-world experiments, demonstrating significant
improvements over existing NAMO frameworks. More details can be found in our
website: https://kai-zhang-er.github.io/namo-uncertainty/
comment: 11 pages, ICINCO2025
★ AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models
Heng Zhang, Haichuan Hu, Yaomin Shen, Weihao Yu, Yilei Yuan, Haochen You, Guo Cheng, Zijian Zhang, Lubin Gan, Huihui Wei, Hao Zhang, Jin Huang
Large Vision-Language Models (LVLMs) have demonstrated impressive performance
on multimodal tasks through scaled architectures and extensive training.
However, existing Mixture of Experts (MoE) approaches face challenges due to
the asymmetry between visual and linguistic processing. Visual information is
spatially complete, while language requires maintaining sequential context. As
a result, MoE models struggle to balance modality-specific features and
cross-modal interactions. Through systematic analysis, we observe that language
experts in deeper layers progressively lose contextual grounding and rely more
on parametric knowledge rather than utilizing the provided visual and
linguistic information. To address this, we propose AsyMoE, a novel
architecture that models this asymmetry using three specialized expert groups.
We design intra-modality experts for modality-specific processing, hyperbolic
inter-modality experts for hierarchical cross-modal interactions, and
evidence-priority language experts to suppress parametric biases and maintain
contextual grounding. Extensive experiments demonstrate that AsyMoE achieves
26.58% and 15.45% accuracy improvements over vanilla MoE and modality-specific
MoE respectively, with 25.45% fewer activated parameters than dense models.
★ MoiréTac: A Dual-Mode Visuotactile Sensor for Multidimensional Perception Using Moiré Pattern Amplification
Visuotactile sensors typically employ sparse marker arrays that limit spatial
resolution and lack clear analytical force-to-image relationships. To solve
this problem, we present \textbf{Moir\'eTac}, a dual-mode sensor that generates
dense interference patterns via overlapping micro-gratings within a transparent
architecture. When two gratings overlap with misalignment, they create moir\'e
patterns that amplify microscopic deformations. The design preserves optical
clarity for vision tasks while producing continuous moir\'e fields for tactile
sensing, enabling simultaneous 6-axis force/torque measurement, contact
localization, and visual perception. We combine physics-based features
(brightness, phase gradient, orientation, and period) from moir\'e patterns
with deep spatial features. These are mapped to 6-axis force/torque
measurements, enabling interpretable regression through end-to-end learning.
Experimental results demonstrate three capabilities: force/torque measurement
with R^2 > 0.98 across tested axes; sensitivity tuning through geometric
parameters (threefold gain adjustment); and vision functionality for object
classification despite moir\'e overlay. Finally, we integrate the sensor into a
robotic arm for cap removal with coordinated force and torque control,
validating its potential for dexterous manipulation.
★ UDON: Uncertainty-weighted Distributed Optimization for Multi-Robot Neural Implicit Mapping under Extreme Communication Constraints
Multi-robot mapping with neural implicit representations enables the compact
reconstruction of complex environments. However, it demands robustness against
communication challenges like packet loss and limited bandwidth. While prior
works have introduced various mechanisms to mitigate communication disruptions,
performance degradation still occurs under extremely low communication success
rates. This paper presents UDON, a real-time multi-agent neural implicit
mapping framework that introduces a novel uncertainty-weighted distributed
optimization to achieve high-quality mapping under severe communication
deterioration. The uncertainty weighting prioritizes more reliable portions of
the map, while the distributed optimization isolates and penalizes mapping
disagreement between individual pairs of communicating agents. We conduct
extensive experiments on standard benchmark datasets and real-world robot
hardware. We demonstrate that UDON significantly outperforms existing
baselines, maintaining high-fidelity reconstructions and consistent scene
representations even under extreme communication degradation (as low as 1%
success rate).
★ Safety filtering of robotic manipulation under environment uncertainty: a computational approach
Robotic manipulation in dynamic and unstructured environments requires safety
mechanisms that exploit what is known and what is uncertain about the world.
Existing safety filters often assume full observability, limiting their
applicability in real-world tasks. We propose a physics-based safety filtering
scheme that leverages high-fidelity simulation to assess control policies under
uncertainty in world parameters. The method combines dense rollout with nominal
parameters and parallelizable sparse re-evaluation at critical
state-transitions, quantified through generalized factors of safety for stable
grasping and actuator limits, and targeted uncertainty reduction through
probing actions. We demonstrate the approach in a simulated bimanual
manipulation task with uncertain object mass and friction, showing that unsafe
trajectories can be identified and filtered efficiently. Our results highlight
physics-based sparse safety evaluation as a scalable strategy for safe robotic
manipulation under uncertainty.
comment: 8 pages, 8 figures
★ PerchMobi^3: A Multi-Modal Robot with Power-Reuse Quad-Fan Mechanism for Air-Ground-Wall Locomotion
Achieving seamless integration of aerial flight, ground driving, and wall
climbing within a single robotic platform remains a major challenge, as
existing designs often rely on additional adhesion actuators that increase
complexity, reduce efficiency, and compromise reliability. To address these
limitations, we present PerchMobi^3, a quad-fan, negative-pressure,
air-ground-wall robot that implements a propulsion-adhesion power-reuse
mechanism. By repurposing four ducted fans to simultaneously provide aerial
thrust and negative-pressure adhesion, and integrating them with four actively
driven wheels, PerchMobi^3 eliminates dedicated pumps while maintaining a
lightweight and compact design. To the best of our knowledge, this is the first
quad-fan prototype to demonstrate functional power reuse for multi-modal
locomotion. A modeling and control framework enables coordinated operation
across ground, wall, and aerial domains with fan-assisted transitions. The
feasibility of the design is validated through a comprehensive set of
experiments covering ground driving, payload-assisted wall climbing, aerial
flight, and cross-mode transitions, demonstrating robust adaptability across
locomotion scenarios. These results highlight the potential of PerchMobi^3 as a
novel design paradigm for multi-modal robotic mobility, paving the way for
future extensions toward autonomous and application-oriented deployment.
comment: 7 pages, 8 figures. This work has been submitted to the IEEE for
possible publication
★ ActiveVLN: Towards Active Exploration via Multi-Turn RL in Vision-and-Language Navigation
The Vision-and-Language Navigation (VLN) task requires an agent to follow
natural language instructions and navigate through complex environments.
Existing MLLM-based VLN methods primarily rely on imitation learning (IL) and
often use DAgger for post-training to mitigate covariate shift. While
effective, these approaches incur substantial data collection and training
costs. Reinforcement learning (RL) offers a promising alternative. However,
prior VLN RL methods lack dynamic interaction with the environment and depend
on expert trajectories for reward shaping, rather than engaging in open-ended
active exploration. This restricts the agent's ability to discover diverse and
plausible navigation routes. To address these limitations, we propose
ActiveVLN, a VLN framework that explicitly enables active exploration through
multi-turn RL. In the first stage, a small fraction of expert trajectories is
used for IL to bootstrap the agent. In the second stage, the agent iteratively
predicts and executes actions, automatically collects diverse trajectories, and
optimizes multiple rollouts via the GRPO objective. To further improve RL
efficiency, we introduce a dynamic early-stopping strategy to prune long-tail
or likely failed trajectories, along with additional engineering optimizations.
Experiments show that ActiveVLN achieves the largest performance gains over IL
baselines compared to both DAgger-based and prior RL-based post-training
methods, while reaching competitive performance with state-of-the-art
approaches despite using a smaller model. Code and data will be released soon.
★ The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning
Titong Jiang, Xuefeng Jiang, Yuan Ma, Xin Wen, Bailin Li, Kun Zhan, Peng Jia, Yahui Liu, Sheng Sun, Xianpeng Lang
We present LightVLA, a simple yet effective differentiable token pruning
framework for vision-language-action (VLA) models. While VLA models have shown
impressive capability in executing real-world robotic tasks, their deployment
on resource-constrained platforms is often bottlenecked by the heavy
attention-based computation over large sets of visual tokens. LightVLA
addresses this challenge through adaptive, performance-driven pruning of visual
tokens: It generates dynamic queries to evaluate visual token importance, and
adopts Gumbel softmax to enable differentiable token selection. Through
fine-tuning, LightVLA learns to preserve the most informative visual tokens
while pruning tokens which do not contribute to task execution, thereby
improving efficiency and performance simultaneously. Notably, LightVLA requires
no heuristic magic numbers and introduces no additional trainable parameters,
making it compatible with modern inference frameworks. Experimental results
demonstrate that LightVLA outperforms different VLA models and existing token
pruning methods across diverse tasks on the LIBERO benchmark, achieving higher
success rates with substantially reduced computational overhead. Specifically,
LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.9%
improvement in task success rate. Meanwhile, we also investigate the learnable
query-based token pruning method LightVLA* with additional trainable
parameters, which also achieves satisfactory performance. Our work reveals that
as VLA pursues optimal performance, LightVLA spontaneously learns to prune
tokens from a performance-driven perspective. To the best of our knowledge,
LightVLA is the first work to apply adaptive visual token pruning to VLA tasks
with the collateral goals of efficiency and performance, marking a significant
step toward more efficient, powerful and practical real-time robotic systems.
comment: Under review. Project site:
https://liauto-research.github.io/LightVLA
★ Robust Online Residual Refinement via Koopman-Guided Dynamics Modeling
Imitation learning (IL) enables efficient skill acquisition from
demonstrations but often struggles with long-horizon tasks and high-precision
control due to compounding errors. Residual policy learning offers a promising,
model-agnostic solution by refining a base policy through closed-loop
corrections. However, existing approaches primarily focus on local corrections
to the base policy, lacking a global understanding of state evolution, which
limits robustness and generalization to unseen scenarios. To address this, we
propose incorporating global dynamics modeling to guide residual policy
updates. Specifically, we leverage Koopman operator theory to impose linear
time-invariant structure in a learned latent space, enabling reliable state
transitions and improved extrapolation for long-horizon prediction and unseen
environments. We introduce KORR (Koopman-guided Online Residual Refinement), a
simple yet effective framework that conditions residual corrections on
Koopman-predicted latent states, enabling globally informed and stable action
refinement. We evaluate KORR on long-horizon, fine-grained robotic furniture
assembly tasks under various perturbations. Results demonstrate consistent
gains in performance, robustness, and generalization over strong baselines. Our
findings further highlight the potential of Koopman-based modeling to bridge
modern learning methods with classical control theory.
★ Pre-trained Visual Representations Generalize Where it Matters in Model-Based Reinforcement Learning
In visuomotor policy learning, the control policy for the robotic agent is
derived directly from visual inputs. The typical approach, where a policy and
vision encoder are trained jointly from scratch, generalizes poorly to novel
visual scene changes. Using pre-trained vision models (PVMs) to inform a policy
network improves robustness in model-free reinforcement learning (MFRL). Recent
developments in Model-based reinforcement learning (MBRL) suggest that MBRL is
more sample-efficient than MFRL. However, counterintuitively, existing work has
found PVMs to be ineffective in MBRL. Here, we investigate PVM's effectiveness
in MBRL, specifically on generalization under visual domain shifts. We show
that, in scenarios with severe shifts, PVMs perform much better than a baseline
model trained from scratch. We further investigate the effects of varying
levels of fine-tuning of PVMs. Our results show that partial fine-tuning can
maintain the highest average task performance under the most extreme
distribution shifts. Our results demonstrate that PVMs are highly successful in
promoting robustness in visual policy learning, providing compelling evidence
for their wider adoption in model-based robotic learning applications.
♻ ★ GBPP: Grasp-Aware Base Placement Prediction for Robots via Two-Stage Learning
GBPP is a fast learning based scorer that selects a robot base pose for
grasping from a single RGB-D snapshot. The method uses a two stage curriculum:
(1) a simple distance-visibility rule auto-labels a large dataset at low cost;
and (2) a smaller set of high fidelity simulation trials refines the model to
match true grasp outcomes. A PointNet++ style point cloud encoder with an MLP
scores dense grids of candidate poses, enabling rapid online selection without
full task-and-motion optimization. In simulation and on a real mobile
manipulator, GBPP outperforms proximity and geometry only baselines, choosing
safer and more reachable stances and degrading gracefully when wrong. The
results offer a practical recipe for data efficient, geometry aware base
placement: use inexpensive heuristics for coverage, then calibrate with
targeted simulation.
comment: This paper needs major revision
♻ ★ Towards Autonomous In-situ Soil Sampling and Mapping in Large-Scale Agricultural Environments ICRA
Thien Hoang Nguyen, Erik Muller, Michael Rubin, Xiaofei Wang, Fiorella Sibona, Alex McBratney, Salah Sukkarieh
Traditional soil sampling and analysis methods are labor-intensive,
time-consuming, and limited in spatial resolution, making them unsuitable for
large-scale precision agriculture. To address these limitations, we present a
robotic solution for real-time sampling, analysis and mapping of key soil
properties. Our system consists of two main sub-systems: a Sample Acquisition
System (SAS) for precise, automated in-field soil sampling; and a Sample
Analysis Lab (Lab) for real-time soil property analysis. The system's
performance was validated through extensive field trials at a large-scale
Australian farm. Experimental results show that the SAS can consistently
acquire soil samples with a mass of 50g at a depth of 200mm, while the Lab can
process each sample within 10 minutes to accurately measure pH and
macronutrients. These results demonstrate the potential of the system to
provide farmers with timely, data-driven insights for more efficient and
sustainable soil management and fertilizer application.
comment: Presented at the 2025 IEEE ICRA Workshop on Field Robotics
♻ ★ FEWT: Improving Humanoid Robot Perception with Frequency-Enhanced Wavelet-based Transformers
Jiaxin Huang, Hanyu Liu, Yunsheng Ma, Jian Shen, Yilin Zheng, Jiayi Wen, Baishu Wan, Pan Li, Zhigong Song
The embodied intelligence bridges the physical world and information space.
As its typical physical embodiment, humanoid robots have shown great promise
through robot learning algorithms in recent years. In this study, a hardware
platform, including humanoid robot and exoskeleton-style teleoperation cabin,
was developed to realize intuitive remote manipulation and efficient collection
of anthropomorphic action data. To improve the perception representation of
humanoid robot, an imitation learning framework, termed Frequency-Enhanced
Wavelet-based Transformer (FEWT), was proposed, which consists of two primary
modules: Frequency-Enhanced Efficient Multi-Scale Attention (FE-EMA) and
Time-Series Discrete Wavelet Transform (TS-DWT). By combining multi-scale
wavelet decomposition with the residual network, FE-EMA can dynamically fuse
features from both cross-spatial and frequency-domain. This fusion is able to
capture feature information across various scales effectively, thereby
enhancing model robustness. Experimental performance demonstrates that FEWT
improves the success rate of the state-of-the-art algorithm (Action Chunking
with Transformers, ACT baseline) by up to 30% in simulation and by 6-12% in
real-world.
♻ ★ Multi-objective task allocation for electric harvesting robots: a hierarchical route reconstruction approach
Peng Chen, Jing Liang, Hui Song, Kang-Jia Qiao, Cai-Tong Yue, Kun-Jie Yu, Ponnuthurai Nagaratnam Suganthan, Witold Pedrycz
The increasing labor costs in agriculture have accelerated the adoption of
multi-robot systems for orchard harvesting. However, efficiently coordinating
these systems is challenging due to the complex interplay between makespan and
energy consumption, particularly under practical constraints like
load-dependent speed variations and battery limitations. This paper defines the
multi-objective agricultural multi-electrical-robot task allocation (AMERTA)
problem, which systematically incorporates these often-overlooked real-world
constraints. To address this problem, we propose a hybrid hierarchical route
reconstruction algorithm (HRRA) that integrates several innovative mechanisms,
including a hierarchical encoding structure, a dual-phase initialization
method, task sequence optimizers, and specialized route reconstruction
operators. Extensive experiments on 45 test instances demonstrate HRRA's
superior performance against seven state-of-the-art algorithms. Statistical
analysis, including the Wilcoxon signed-rank and Friedman tests, empirically
validates HRRA's competitiveness and its unique ability to explore previously
inaccessible regions of the solution space. In general, this research
contributes to the theoretical understanding of multi-robot coordination by
offering a novel problem formulation and an effective algorithm, thereby also
providing practical insights for agricultural automation.
♻ ★ Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild
Derek Ming Siang Tan, Shailesh, Boyang Liu, Alok Raj, Qi Xuan Ang, Weiheng Dai, Tanishq Duhan, Jimmy Chiun, Yuhong Cao, Florian Shkurti, Guillaume Sartoretti
To perform outdoor autonomous visual navigation and search, a robot may
leverage satellite imagery as a prior map. This can help inform high-level
search and exploration strategies, even when such images lack sufficient
resolution to allow for visual recognition of targets. However, there are
limited training datasets of satellite images with annotated targets that are
not directly visible. Furthermore, approaches which leverage large Vision
Language Models (VLMs) for generalization may yield inaccurate outputs due to
hallucination, leading to inefficient search. To address these challenges, we
introduce Search-TTA, a multimodal test-time adaptation framework with a
flexible plug-and-play interface compatible with various input modalities (e.g.
image, text, sound) and planning methods. First, we pretrain a satellite image
encoder to align with CLIP's visual encoder to output probability distributions
of target presence used for visual search. Second, our framework dynamically
refines CLIP's predictions during search using a test-time adaptation
mechanism. Through a novel feedback loop inspired by Spatial Poisson Point
Processes, uncertainty-weighted gradient updates are used to correct
potentially inaccurate predictions and improve search performance. To train and
evaluate Search-TTA, we curate AVS-Bench, a visual search dataset based on
internet-scale ecological data that contains up to 380k training and 8k
validation images (in- and out-domain). We find that Search-TTA improves
planner performance by up to 30.0%, particularly in cases with poor initial
CLIP predictions due to limited training data. It also performs comparably with
significantly larger VLMs, and achieves zero-shot generalization to unseen
modalities. Finally, we deploy Search-TTA on a real UAV via
hardware-in-the-loop testing, by simulating its operation within a large-scale
simulation that provides onboard sensing.
comment: Accepted for presentation at CORL 2025. [Link to Paper
Website](https://search-tta.github.io/)
♻ ★ Data-fused Model Predictive Control with Guarantees: Application to Flying Humanoid Robots
This paper introduces a Data-Fused Model Predictive Control (DFMPC) framework
that combines physics-based models with data-driven representations of unknown
dynamics. Leveraging Willems' Fundamental Lemma and an artificial equilibrium
formulation, the method enables tracking of changing, potentially unreachable
setpoints while explicitly handling measurement noise through slack variables
and regularization. We provide guarantees of recursive feasibility and
practical stability under input-output constraints for a specific class of
reference signals. The approach is validated on the iRonCub flying humanoid
robot, integrating analytical momentum models with data-driven turbine
dynamics. Simulations show improved tracking and robustness compared to a
purely model-based MPC, while maintaining real-time feasibility.
comment: 8 pages, 3 figures
♻ ★ TrojanRobot: Physical-world Backdoor Attacks Against VLM-based Robotic Manipulation
Xianlong Wang, Hewen Pan, Hangtao Zhang, Minghui Li, Shengshan Hu, Ziqi Zhou, Lulu Xue, Aishan Liu, Yunpeng Jiang, Leo Yu Zhang, Xiaohua Jia
Robotic manipulation in the physical world is increasingly empowered by
\textit{large language models} (LLMs) and \textit{vision-language models}
(VLMs), leveraging their understanding and perception capabilities. Recently,
various attacks against such robotic policies have been proposed, with backdoor
attacks drawing considerable attention for their high stealth and strong
persistence capabilities. However, existing backdoor efforts are limited to
simulators and suffer from physical-world realization. To address this, we
propose \textit{TrojanRobot}, a highly stealthy and broadly effective robotic
backdoor attack in the physical world. Specifically, we introduce a
module-poisoning approach by embedding a backdoor module into the modular
robotic policy, enabling backdoor control over the policy's visual perception
module thereby backdooring the entire robotic policy. Our vanilla
implementation leverages a backdoor-finetuned VLM to serve as the backdoor
module. To enhance its generalization in physical environments, we propose a
prime implementation, leveraging the LVLM-as-a-backdoor paradigm and developing
three types of prime attacks, \ie, \textit{permutation}, \textit{stagnation},
and \textit{intentional} attacks, thus achieving finer-grained backdoors.
Extensive experiments on the UR3e manipulator with 18 task instructions using
robotic policies based on four VLMs demonstrate the broad effectiveness and
physical-world stealth of TrojanRobot. Our attack's video demonstrations are
available via a github link https://trojanrobot.github.io.
♻ ★ Evaluating the Robustness of Open-Source Vision-Language Models to Domain Shift in Object Captioning
Vision-Language Models (VLMs) have emerged as powerful tools for generating
textual descriptions from visual data. While these models excel on web-scale
datasets, their robustness to the domain shifts inherent in many real-world
applications remains under-explored. This paper presents a systematic
evaluation of VLM performance on a single-view object captioning task when
faced with a controlled, physical domain shift. We compare captioning accuracy
across two distinct object sets: a collection of multi-material, real-world
tools and a set of single-material, 3D-printed items. The 3D-printed set
introduces a significant domain shift in texture and material properties,
challenging the models' generalization capabilities. Our quantitative results
demonstrate that all tested VLMs show a marked performance degradation when
describing the 3D-printed objects compared to the real-world tools. This
underscores a critical limitation in the ability of current models to
generalize beyond surface-level features and highlights the need for more
robust architectures for real-world signal processing applications.
♻ ★ Built Different: Tactile Perception to Overcome Cross-Embodiment Capability Differences in Collaborative Manipulation ICRA 2026
Tactile sensing is a widely-studied means of implicit communication between
robot and human. In this paper, we investigate how tactile sensing can help
bridge differences between robotic embodiments in the context of collaborative
manipulation. For a robot, learning and executing force-rich collaboration
require compliance to human interaction. While compliance is often achieved
with admittance control, many commercial robots lack the joint torque
monitoring needed for such control. To address this challenge, we present an
approach that uses tactile sensors and behavior cloning to transfer policies
from robots with these capabilities to those without. We train a single policy
that demonstrates positive transfer across embodiments, including robots
without torque sensing. We demonstrate this positive transfer on four different
tactile-enabled embodiments using the same policy trained on force-controlled
robot data. Across multiple proposed metrics, the best performance came from a
decomposed tactile shear-field representation combined with a pre-trained
encoder, which improved success rates over alternative representations.
comment: 8 pages including references, 8 figures, 2 tables, submitted to ICRA
2026
♻ ★ Learning Environment-Aware Affordance for 3D Articulated Object Manipulation under Occlusions NeurIPS 2023
Perceiving and manipulating 3D articulated objects in diverse environments is
essential for home-assistant robots. Recent studies have shown that point-level
affordance provides actionable priors for downstream manipulation tasks.
However, existing works primarily focus on single-object scenarios with
homogeneous agents, overlooking the realistic constraints imposed by the
environment and the agent's morphology, e.g., occlusions and physical
limitations. In this paper, we propose an environment-aware affordance
framework that incorporates both object-level actionable priors and environment
constraints. Unlike object-centric affordance approaches, learning
environment-aware affordance faces the challenge of combinatorial explosion due
to the complexity of various occlusions, characterized by their quantities,
geometries, positions and poses. To address this and enhance data efficiency,
we introduce a novel contrastive affordance learning framework capable of
training on scenes containing a single occluder and generalizing to scenes with
complex occluder combinations. Experiments demonstrate the effectiveness of our
proposed approach in learning affordance considering environment constraints.
Project page at https://chengkaiacademycity.github.io/EnvAwareAfford/
comment: In 37th Conference on Neural Information Processing Systems (NeurIPS
2023). Website at https://chengkaiacademycity.github.io/EnvAwareAfford/
♻ ★ ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation
Jiawen Yu, Hairuo Liu, Qiaojun Yu, Jieji Ren, Ce Hao, Haitong Ding, Guangyu Huang, Guofan Huang, Yan Song, Panpan Cai, Cewu Lu, Wenqiang Zhang
Vision-Language-Action (VLA) models have advanced general-purpose robotic
manipulation by leveraging pretrained visual and linguistic representations.
However, they struggle with contact-rich tasks that require fine-grained
control involving force, especially under visual occlusion or dynamic
uncertainty. To address these limitations, we propose ForceVLA, a novel
end-to-end manipulation framework that treats external force sensing as a
first-class modality within VLA systems. ForceVLA introduces FVLMoE, a
force-aware Mixture-of-Experts fusion module that dynamically integrates
pretrained visual-language embeddings with real-time 6-axis force feedback
during action decoding. This enables context-aware routing across
modality-specific experts, enhancing the robot's ability to adapt to subtle
contact dynamics. We also introduce \textbf{ForceVLA-Data}, a new dataset
comprising synchronized vision, proprioception, and force-torque signals across
five contact-rich manipulation tasks. ForceVLA improves average task success by
23.2% over strong pi_0-based baselines, achieving up to 80% success in tasks
such as plug insertion. Our approach highlights the importance of multimodal
integration for dexterous manipulation and sets a new benchmark for physically
intelligent robotic control. Code and data will be released at
https://sites.google.com/view/forcevla2025.
♻ ★ Spiking Neural Networks for Continuous Control via End-to-End Model-Based Learning
Despite recent progress in training spiking neural networks (SNNs) for
classification, their application to continuous motor control remains limited.
Here, we demonstrate that fully spiking architectures can be trained end-to-end
to control robotic arms with multiple degrees of freedom in continuous
environments. Our predictive-control framework combines Leaky
Integrate-and-Fire dynamics with surrogate gradients, jointly optimizing a
forward model for dynamics prediction and a policy network for goal-directed
action. We evaluate this approach on both a planar 2D reaching task and a
simulated 6-DOF Franka Emika Panda robot. Results show that SNNs can achieve
stable training and accurate torque control, establishing their viability for
high-dimensional motor tasks. An extensive ablation study highlights the role
of initialization, learnable time constants, and regularization in shaping
training dynamics. We conclude that while stable and effective control can be
achieved, recurrent spiking networks remain highly sensitive to hyperparameter
settings, underscoring the importance of principled design choices.
♻ ★ RoboMatch: A Unified Mobile-Manipulation Teleoperation Platform with Auto-Matching Network Architecture for Long-Horizon Tasks
Hanyu Liu, Yunsheng Ma, Jiaxin Huang, Keqiang Ren, Jiayi Wen, Yilin Zheng, Baishu Wan, Pan Li, Jiejun Hou, Haoru Luan, Zhihua Wang, Zhigong Song
This paper presents RoboMatch, a novel unified teleoperation platform for
mobile manipulation with an auto-matching network architecture, designed to
tackle long-horizon tasks in dynamic environments. Our system enhances
teleoperation performance, data collection efficiency, task accuracy, and
operational stability. The core of RoboMatch is a cockpit-style control
interface that enables synchronous operation of the mobile base and dual arms,
significantly improving control precision and data collection. Moreover, we
introduce the Proprioceptive-Visual Enhanced Diffusion Policy (PVE-DP), which
leverages Discrete Wavelet Transform (DWT) for multi-scale visual feature
extraction and integrates high-precision IMUs at the end-effector to enrich
proprioceptive feedback, substantially boosting fine manipulation performance.
Furthermore, we propose an Auto-Matching Network (AMN) architecture that
decomposes long-horizon tasks into logical sequences and dynamically assigns
lightweight pre-trained models for distributed inference. Experimental results
demonstrate that our approach improves data collection efficiency by over 20%,
increases task success rates by 20-30% with PVE-DP, and enhances long-horizon
inference performance by approximately 40% with AMN, offering a robust solution
for complex manipulation tasks.
♻ ★ TransDiffuser: Diverse Trajectory Generation with Decorrelated Multi-modal Representation for End-to-end Autonomous Driving
Xuefeng Jiang, Yuan Ma, Pengxiang Li, Leimeng Xu, Xin Wen, Kun Zhan, Zhongpu Xia, Peng Jia, Xianpeng Lang, Sheng Sun
In recent years, diffusion models have demonstrated remarkable potential
across diverse domains, from vision generation to language modeling.
Transferring its generative capabilities to modern end-to-end autonomous
driving systems has also emerged as a promising direction. However, existing
diffusion-based trajectory generative models often exhibit mode collapse where
different random noises converge to similar trajectories after the denoising
process.Therefore, state-of-the-art models often rely on anchored trajectories
from pre-defined trajectory vocabulary or scene priors in the training set to
mitigate collapse and enrich the diversity of generated trajectories, but such
inductive bias are not available in real-world deployment, which can be
challenged when generalizing to unseen scenarios. In this work, we investigate
the possibility of effectively tackling the mode collapse challenge without the
assumption of pre-defined trajectory vocabulary or pre-computed scene priors.
Specifically, we propose TransDiffuser, an encoder-decoder based generative
trajectory planning model, where the encoded scene information and motion
states serve as the multi-modal conditional input of the denoising decoder.
Different from existing approaches, we exploit a simple yet effective
multi-modal representation decorrelation optimization mechanism during the
denoising process to enrich the latent representation space which better guides
the downstream generation. Without any predefined trajectory anchors or
pre-computed scene priors, TransDiffuser achieves the PDMS of 94.85 on the
closed-loop planning-oriented benchmark NAVSIM, surpassing previous
state-of-the-art methods. Qualitative evaluation further showcases
TransDiffuser generates more diverse and plausible trajectories which explore
more drivable area.
comment: Under review
♻ ★ Plane Detection and Ranking via Model Information Optimization IROS
Plane detection from depth images is a crucial subtask with broad robotic
applications, often accomplished by iterative methods such as Random Sample
Consensus (RANSAC). While RANSAC is a robust strategy with strong probabilistic
guarantees, the ambiguity of its inlier threshold criterion makes it
susceptible to false positive plane detections. This issue is particularly
prevalent in complex real-world scenes, where the true number of planes is
unknown and multiple planes coexist. In this paper, we aim to address this
limitation by proposing a generalised framework for plane detection based on
model information optimization. Building on previous works, we treat the
observed depth readings as discrete random variables, with their probability
distributions constrained by the ground truth planes. Various models containing
different candidate plane constraints are then generated through repeated
random sub-sampling to explain our observations. By incorporating the physics
and noise model of the depth sensor, we can calculate the information for each
model, and the model with the least information is accepted as the most likely
ground truth. This information optimization process serves as an objective
mechanism for determining the true number of planes and preventing false
positive detections. Additionally, the quality of each detected plane can be
ranked by summing the information reduction of inlier points for each plane. We
validate these properties through experiments with synthetic data and find that
our algorithm estimates plane parameters more accurately compared to the
default Open3D RANSAC plane segmentation. Furthermore, we accelerate our
algorithm by partitioning the depth map using neural network segmentation,
which enhances its ability to generate more realistic plane parameters in
real-world data.
comment: Accepted as contributed paper in the 2025 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS)
♻ ★ Sign Language: Towards Sign Understanding for Robot Autonomy
Navigational signs are common aids for human wayfinding and scene
understanding, but are underutilized by robots. We argue that they benefit
robot navigation and scene understanding, by directly encoding privileged
information on actions, spatial regions, and relations. Interpreting signs in
open-world settings remains a challenge owing to the complexity of scenes and
signs, but recent advances in vision-language models (VLMs) make this feasible.
To advance progress in this area, we introduce the task of navigational sign
understanding which parses locations and associated directions from signs. We
offer a benchmark for this task, proposing appropriate evaluation metrics and
curating a test set capturing signs with varying complexity and design across
diverse public spaces, from hospitals to shopping malls to transport hubs. We
also provide a baseline approach using VLMs, and demonstrate their promise on
navigational sign understanding. Code and dataset are available on Github.
comment: This work has been submitted to the IEEE for possible publication
♻ ★ Keypoint-based Diffusion for Robotic Motion Planning on the NICOL Robot ICANN 2025
We propose a novel diffusion-based action model for robotic motion planning.
Commonly, established numerical planning approaches are used to solve general
motion planning problems, but have significant runtime requirements. By
leveraging the power of deep learning, we are able to achieve good results in a
much smaller runtime by learning from a dataset generated by these planners.
While our initial model uses point cloud embeddings in the input to predict
keypoint-based joint sequences in its output, we observed in our ablation study
that it remained challenging to condition the network on the point cloud
embeddings. We identified some biases in our dataset and refined it, which
improved the model's performance. Our model, even without the use of the point
cloud encodings, outperforms numerical models by an order of magnitude
regarding the runtime, while reaching a success rate of up to 90% of collision
free solutions on the test set.
comment: Accepted and published at the 34th International Conference on
Artificial Neural Networks (ICANN 2025)
♻ ★ FCRF: Flexible Constructivism Reflection for Long-Horizon Robotic Task Planning with Large Language Models IROS 2025
Autonomous error correction is critical for domestic robots to achieve
reliable execution of complex long-horizon tasks. Prior work has explored
self-reflection in Large Language Models (LLMs) for task planning error
correction; however, existing methods are constrained by inflexible
self-reflection mechanisms that limit their effectiveness. Motivated by these
limitations and inspired by human cognitive adaptation, we propose the Flexible
Constructivism Reflection Framework (FCRF), a novel Mentor-Actor architecture
that enables LLMs to perform flexible self-reflection based on task difficulty,
while constructively integrating historical valuable experience with failure
lessons. We evaluated FCRF on diverse domestic tasks through simulation in
AlfWorld and physical deployment in the real-world environment. Experimental
results demonstrate that FCRF significantly improves overall performance and
self-reflection flexibility in complex long-horizon robotic tasks.
comment: 8 pages, 6 figures, IROS 2025
♻ ★ Traversing the Narrow Path: A Two-Stage Reinforcement Learning Framework for Humanoid Beam Walking
Traversing narrow paths is challenging for humanoid robots due to the sparse
and safety-critical footholds required. Purely template-based or end-to-end
reinforcement learning-based methods suffer from such harsh terrains. This
paper proposes a two stage training framework for such narrow path traversing
tasks, coupling a template-based foothold planner with a low-level foothold
tracker from Stage-I training and a lightweight perception aided foothold
modifier from Stage-II training. With the curriculum setup from flat ground to
narrow paths across stages, the resulted controller in turn learns to robustly
track and safely modify foothold targets to ensure precise foot placement over
narrow paths. This framework preserves the interpretability from the
physics-based template and takes advantage of the generalization capability
from reinforcement learning, resulting in easy sim-to-real transfer. The
learned policies outperform purely template-based or reinforcement
learning-based baselines in terms of success rate, centerline adherence and
safety margins. Validation on a Unitree G1 humanoid robot yields successful
traversal of a 0.2m wide and 3m long beam for 20 trials without any failure.
comment: Project website:
https://huangtc233.github.io/Traversing-the-Narrow-Path/
♻ ★ Towards Bio-Inspired Robotic Trajectory Planning via Self-Supervised RNN ICANN
Trajectory planning in robotics is understood as generating a sequence of
joint configurations that will lead a robotic agent, or its manipulator, from
an initial state to the desired final state, thus completing a manipulation
task while considering constraints like robot kinematics and the environment.
Typically, this is achieved via sampling-based planners, which are
computationally intensive. Recent advances demonstrate that trajectory planning
can also be performed by supervised sequence learning of trajectories, often
requiring only a single or fixed number of passes through a neural
architecture, thus ensuring a bounded computation time. Such fully supervised
approaches, however, perform imitation learning; they do not learn based on
whether the trajectories can successfully reach a goal, but try to reproduce
observed trajectories. In our work, we build on this approach and propose a
cognitively inspired self-supervised learning scheme based on a recurrent
architecture for building a trajectory model. We evaluate the feasibility of
the proposed method on a task of kinematic planning for a robotic arm. The
results suggest that the model is able to learn to generate trajectories only
using given paired forward and inverse kinematics models, and indicate that
this novel method could facilitate planning for more complex manipulation tasks
requiring adaptive solutions.
comment: 12 pages, 4 figures, 2 tables. To be published in 2025 International
Conference on Artificial Neural Networks (ICANN) proceedings. This research
was funded by the Horizon Europe project TERAIS, GA no. 101079338, and in
part by the Slovak Grant Agency for Science (VEGA), project 1/0373/23. The
code can be found at https://doi.org/10.5281/zenodo.17127997